-
Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models
Authors:
Dongwon Jo,
Taesu Kim,
Yulhwa Kim,
Jae-Joon Kim
Abstract:
Binarization, which converts weight parameters to binary values, has emerged as an effective strategy to reduce the size of large language models (LLMs). However, typical binarization techniques significantly diminish linguistic effectiveness of LLMs. To address this issue, we introduce a novel binarization technique called Mixture of Scales (BinaryMoS). Unlike conventional methods, BinaryMoS empl…
▽ More
Binarization, which converts weight parameters to binary values, has emerged as an effective strategy to reduce the size of large language models (LLMs). However, typical binarization techniques significantly diminish linguistic effectiveness of LLMs. To address this issue, we introduce a novel binarization technique called Mixture of Scales (BinaryMoS). Unlike conventional methods, BinaryMoS employs multiple scaling experts for binary weights, dynamically merging these experts for each token to adaptively generate scaling factors. This token-adaptive approach boosts the representational power of binarized LLMs by enabling contextual adjustments to the values of binary weights. Moreover, because this adaptive process only involves the scaling factors rather than the entire weight matrix, BinaryMoS maintains compression efficiency similar to traditional static binarization methods. Our experimental results reveal that BinaryMoS surpasses conventional binarization techniques in various natural language processing tasks and even outperforms 2-bit quantization methods, all while maintaining similar model size to static binarization techniques.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference
Authors:
Donghyeon Joo,
Ramyad Hadidi,
Soheil Feizi,
Bahar Asgari
Abstract:
The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case stud…
▽ More
The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case study of offloaded inference, we found that due to the low bandwidth between storage devices and GPU, the latency of transferring large model weights from its offloaded location to GPU memory becomes the critical bottleneck with actual compute taking nearly 0% of runtime. To effectively reduce the weight transfer latency, we propose a novel sparse format that compresses the unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead. Endor achieves this by expressing the positions of non-zero elements with a bitmap. Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x. When direct weight transfer from SSD to GPU is leveraged, Endor achieves 2.25x speedup on OPT-66B and 2.37x speedup on Llama2-70B.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Active Learning for Finely-Categorized Image-Text Retrieval by Selecting Hard Negative Unpaired Samples
Authors:
Dae Ung Jo,
Kyuewang Lee,
JaeHo Chung,
Jin Young Choi
Abstract:
Securing a sufficient amount of paired data is important to train an image-text retrieval (ITR) model, but collecting paired data is very expensive. To address this issue, in this paper, we propose an active learning algorithm for ITR that can collect paired data cost-efficiently. Previous studies assume that image-text pairs are given and their category labels are asked to the annotator. However,…
▽ More
Securing a sufficient amount of paired data is important to train an image-text retrieval (ITR) model, but collecting paired data is very expensive. To address this issue, in this paper, we propose an active learning algorithm for ITR that can collect paired data cost-efficiently. Previous studies assume that image-text pairs are given and their category labels are asked to the annotator. However, in the recent ITR studies, the importance of category label is decreased since a retrieval model can be trained with only image-text pairs. For this reason, we set up an active learning scenario where unpaired images (or texts) are given and the annotator provides corresponding texts (or images) to make paired data. The key idea of the proposed AL algorithm is to select unpaired images (or texts) that can be hard negative samples for existing texts (or images). To this end, we introduce a novel scoring function to choose hard negative samples. We validate the effectiveness of the proposed method on Flickr30K and MS-COCO datasets.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Hexa: Self-Improving for Knowledge-Grounded Dialogue System
Authors:
Daejin Jo,
Daniel Wontae Nam,
Gunsoo Han,
Kyoung-Woon On,
Taehwan Kwon,
Seungeun Rho,
Sungwoong Kim
Abstract:
A common practice in knowledge-grounded dialogue generation is to explicitly utilize intermediate steps (e.g., web-search, memory retrieval) with modular approaches. However, data for such steps are often inaccessible compared to those of dialogue responses as they are unobservable in an ordinary dialogue. To fill in the absence of these data, we develop a self-improving method to improve the gene…
▽ More
A common practice in knowledge-grounded dialogue generation is to explicitly utilize intermediate steps (e.g., web-search, memory retrieval) with modular approaches. However, data for such steps are often inaccessible compared to those of dialogue responses as they are unobservable in an ordinary dialogue. To fill in the absence of these data, we develop a self-improving method to improve the generative performances of intermediate steps without the ground truth data. In particular, we propose a novel bootstrapping scheme with a guided prompt and a modified loss function to enhance the diversity of appropriate self-generated responses. Through experiments on various benchmark datasets, we empirically demonstrate that our method successfully leverages a self-improving mechanism in generating intermediate and final responses and improves the performances on the task of knowledge-grounded dialogue generation.
△ Less
Submitted 2 April, 2024; v1 submitted 10 October, 2023;
originally announced October 2023.
-
Squeezing Large-Scale Diffusion Models for Mobile
Authors:
Jiwoong Choi,
Minkyu Kim,
Daehyun Ahn,
Taesu Kim,
Yulhwa Kim,
Dongwon Jo,
Hyesung Jeon,
Jae-Joon Kim,
Hyungjun Kim
Abstract:
The emergence of diffusion models has greatly broadened the scope of high-fidelity image synthesis, resulting in notable advancements in both practical implementation and academic research. With the active adoption of the model in various real-world applications, the need for on-device deployment has grown considerably. However, deploying large diffusion models such as Stable Diffusion with more t…
▽ More
The emergence of diffusion models has greatly broadened the scope of high-fidelity image synthesis, resulting in notable advancements in both practical implementation and academic research. With the active adoption of the model in various real-world applications, the need for on-device deployment has grown considerably. However, deploying large diffusion models such as Stable Diffusion with more than one billion parameters to mobile devices poses distinctive challenges due to the limited computational and memory resources, which may vary according to the device. In this paper, we present the challenges and solutions for deploying Stable Diffusion on mobile devices with TensorFlow Lite framework, which supports both iOS and Android devices. The resulting Mobile Stable Diffusion achieves the inference latency of smaller than 7 seconds for a 512x512 image generation on Android devices with mobile GPUs.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Effortless Integration of Memory Management into Open-Domain Conversation Systems
Authors:
Eunbi Choi,
Kyoung-Woon On,
Gunsoo Han,
Sungwoong Kim,
Daniel Wontae Nam,
Daejin Jo,
Seung Eun Rho,
Taehwan Kwon,
Minjoon Seo
Abstract:
Open-domain conversation systems integrate multiple conversation skills into a single system through a modular approach. One of the limitations of the system, however, is the absence of management capability for external memory. In this paper, we propose a simple method to improve BlenderBot3 by integrating memory management ability into it. Since no training data exists for this purpose, we propo…
▽ More
Open-domain conversation systems integrate multiple conversation skills into a single system through a modular approach. One of the limitations of the system, however, is the absence of management capability for external memory. In this paper, we propose a simple method to improve BlenderBot3 by integrating memory management ability into it. Since no training data exists for this purpose, we propose an automating dataset creation for memory management. Our method 1) requires little cost for data construction, 2) does not affect performance in other tasks, and 3) reduces external memory. We show that our proposed model BlenderBot3-M^3, which is multi-task trained with memory management, outperforms BlenderBot3 with a relative 4% performance gain in terms of F1 score.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
MAGVLT: Masked Generative Vision-and-Language Transformer
Authors:
Sungwoong Kim,
Daejin Jo,
Donghoon Lee,
Jongmin Kim
Abstract:
While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and t…
▽ More
While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and text sequences. Especially, we propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT). In comparison to ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement, and extended editing capabilities such as image and text infilling. For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to-image, and joint image-and-text mask prediction tasks. Moreover, we devise two additional tasks based on the step-unrolled mask prediction and the selective prediction on the mixture of two image-text pairs. Experimental results on various downstream generation tasks of VL benchmarks show that our MAGVLT outperforms ARGVLT by a large margin even with significant inference speedup. Particularly, MAGVLT achieves competitive results on both zero-shot image-to-text and text-to-image generation tasks from MS-COCO by one moderate-sized model (fewer than 500M parameters) even without the use of monomodal data and networks.
△ Less
Submitted 21 March, 2023;
originally announced March 2023.
-
LECO: Learnable Episodic Count for Task-Specific Intrinsic Reward
Authors:
Daejin Jo,
Sungwoong Kim,
Daniel Wontae Nam,
Taehwan Kwon,
Seungeun Rho,
Jongmin Kim,
Donghoon Lee
Abstract:
Episodic count has been widely used to design a simple yet effective intrinsic motivation for reinforcement learning with a sparse reward. However, the use of episodic count in a high-dimensional state space as well as over a long episode time requires a thorough state compression and fast hashing, which hinders rigorous exploitation of it in such hard and complex exploration environments. Moreove…
▽ More
Episodic count has been widely used to design a simple yet effective intrinsic motivation for reinforcement learning with a sparse reward. However, the use of episodic count in a high-dimensional state space as well as over a long episode time requires a thorough state compression and fast hashing, which hinders rigorous exploitation of it in such hard and complex exploration environments. Moreover, the interference from task-irrelevant observations in the episodic count may cause its intrinsic motivation to overlook task-related important changes of states, and the novelty in an episodic manner can lead to repeatedly revisit the familiar states across episodes. In order to resolve these issues, in this paper, we propose a learnable hash-based episodic count, which we name LECO, that efficiently performs as a task-specific intrinsic reward in hard exploration problems. In particular, the proposed intrinsic reward consists of the episodic novelty and the task-specific modulation where the former employs a vector quantized variational autoencoder to automatically obtain the discrete state codes for fast counting while the latter regulates the episodic novelty by learning a modulator to optimize the task-specific extrinsic reward. The proposed LECO specifically enables the automatic transition from exploration to exploitation during reinforcement learning. We experimentally show that in contrast to the previous exploration methods LECO successfully solves hard exploration problems and also scales to large state spaces through the most difficult tasks in MiniGrid and DMLab environments.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Selective Token Generation for Few-shot Natural Language Generation
Authors:
Daejin Jo,
Taehwan Kwon,
Eun-Sol Kim,
Sungwoong Kim
Abstract:
Natural language modeling with limited training data is a challenging problem, and many algorithms make use of large-scale pretrained language models (PLMs) for this due to its great generalization ability. Among them, additive learning that incorporates a task-specific adapter on top of the fixed large-scale PLM has been popularly used in the few-shot setting. However, this added adapter is still…
▽ More
Natural language modeling with limited training data is a challenging problem, and many algorithms make use of large-scale pretrained language models (PLMs) for this due to its great generalization ability. Among them, additive learning that incorporates a task-specific adapter on top of the fixed large-scale PLM has been popularly used in the few-shot setting. However, this added adapter is still easy to disregard the knowledge of the PLM especially for few-shot natural language generation (NLG) since an entire sequence is usually generated by only the newly trained adapter. Therefore, in this work, we develop a novel additive learning algorithm based on reinforcement learning (RL) that selectively outputs language tokens between the task-general PLM and the task-specific adapter during both training and inference. This output token selection over the two generators allows the adapter to take into account solely the task-relevant parts in sequence generation, and therefore makes it more robust to overfitting as well as more stable in RL training. In addition, to obtain the complementary adapter from the PLM for each few-shot task, we exploit a separate selecting module that is also simultaneously trained using RL. Experimental results on various few-shot NLG tasks including question answering, data-to-text generation and text summarization demonstrate that the proposed selective token generation significantly outperforms the previous additive learning algorithms based on the PLMs.
△ Less
Submitted 16 September, 2022;
originally announced September 2022.
-
The PWLR Graph Representation: A Persistent Weisfeiler-Lehman scheme with Random Walks for Graph Classification
Authors:
Sun Woo Park,
Yun Young Choi,
Dosang Joe,
U Jin Choi,
Youngho Woo
Abstract:
This paper presents the Persistent Weisfeiler-Lehman Random walk scheme (abbreviated as PWLR) for graph representations, a novel mathematical framework which produces a collection of explainable low-dimensional representations of graphs with discrete and continuous node features. The proposed scheme effectively incorporates normalized Weisfeiler-Lehman procedure, random walks on graphs, and persis…
▽ More
This paper presents the Persistent Weisfeiler-Lehman Random walk scheme (abbreviated as PWLR) for graph representations, a novel mathematical framework which produces a collection of explainable low-dimensional representations of graphs with discrete and continuous node features. The proposed scheme effectively incorporates normalized Weisfeiler-Lehman procedure, random walks on graphs, and persistent homology. We thereby integrate three distinct properties of graphs, which are local topological features, node degrees, and global topological invariants, while preserving stability from graph perturbations. This generalizes many variants of Weisfeiler-Lehman procedures, which are primarily used to embed graphs with discrete node labels. Empirical results suggest that these representations can be efficiently utilized to produce comparable results to state-of-the-art techniques in classifying graphs with discrete node labels, and enhanced performances in classifying those with continuous node features.
△ Less
Submitted 29 August, 2022;
originally announced August 2022.
-
Insights From the NeurIPS 2021 NetHack Challenge
Authors:
Eric Hambro,
Sharada Mohanty,
Dmitrii Babaev,
Minwoo Byeon,
Dipam Chakraborty,
Edward Grefenstette,
Minqi Jiang,
Daejin Jo,
Anssi Kanervisto,
Jongmin Kim,
Sungwoong Kim,
Robert Kirk,
Vitaly Kurin,
Heinrich Küttler,
Taehwon Kwon,
Donghoon Lee,
Vegard Mella,
Nantas Nardelli,
Ivan Nazarov,
Nikita Ovsov,
Jack Parker-Holder,
Roberta Raileanu,
Karolis Ramanauskas,
Tim Rocktäschel,
Danielle Rothermel
, et al. (4 additional authors not shown)
Abstract:
In this report, we summarize the takeaways from the first NeurIPS 2021 NetHack Challenge. Participants were tasked with developing a program or agent that can win (i.e., 'ascend' in) the popular dungeon-crawler game of NetHack by interacting with the NetHack Learning Environment (NLE), a scalable, procedurally generated, and challenging Gym environment for reinforcement learning (RL). The challeng…
▽ More
In this report, we summarize the takeaways from the first NeurIPS 2021 NetHack Challenge. Participants were tasked with developing a program or agent that can win (i.e., 'ascend' in) the popular dungeon-crawler game of NetHack by interacting with the NetHack Learning Environment (NLE), a scalable, procedurally generated, and challenging Gym environment for reinforcement learning (RL). The challenge showcased community-driven progress in AI with many diverse approaches significantly beating the previously best results on NetHack. Furthermore, it served as a direct comparison between neural (e.g., deep RL) and symbolic AI, as well as hybrid systems, demonstrating that on NetHack symbolic bots currently outperform deep RL by a large margin. Lastly, no agent got close to winning the game, illustrating NetHack's suitability as a long-term benchmark for AI research.
△ Less
Submitted 22 March, 2022;
originally announced March 2022.
-
Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth
Authors:
Doyeon Kim,
Woonghyun Ka,
Pyungwhan Ahn,
Donggyu Joo,
Sehwan Chun,
Junmo Kim
Abstract:
Depth estimation from a single image is an important task that can be applied to various fields in computer vision, and has grown rapidly with the development of convolutional neural networks. In this paper, we propose a novel structure and training strategy for monocular depth estimation to further improve the prediction accuracy of the network. We deploy a hierarchical transformer encoder to cap…
▽ More
Depth estimation from a single image is an important task that can be applied to various fields in computer vision, and has grown rapidly with the development of convolutional neural networks. In this paper, we propose a novel structure and training strategy for monocular depth estimation to further improve the prediction accuracy of the network. We deploy a hierarchical transformer encoder to capture and convey the global context, and design a lightweight yet powerful decoder to generate an estimated depth map while considering local connectivity. By constructing connected paths between multi-scale local features and the global decoding stream with our proposed selective feature fusion module, the network can integrate both representations and recover fine details. In addition, the proposed decoder shows better performance than the previously proposed decoders, with considerably less computational complexity. Furthermore, we improve the depth-specific augmentation method by utilizing an important observation in depth estimation to enhance the model. Our network achieves state-of-the-art performance over the challenging depth dataset NYU Depth V2. Extensive experiments have been conducted to validate and show the effectiveness of the proposed approach. Finally, our model shows better generalisation ability and robustness than other comparative models.
△ Less
Submitted 29 October, 2022; v1 submitted 19 January, 2022;
originally announced January 2022.
-
Beyond 5G URLLC Evolution: New Service Modes and Practical Considerations
Authors:
Hirley Alves,
Gweon Do Jo,
JaeSheung Shin,
Choongil Yeh,
Nurul Huda Mahmood,
Carlos Lima,
Chanho Yoon,
Nandana Rahatheva,
Ok-Sun Park,
Seokki Kim,
Eunah Kim,
Ville Niemelä,
Hyeon Woo Lee,
Ari Pouttu,
Hyun Kyu Chung,
Matti Latva-aho
Abstract:
Ultra-reliable low latency communications (URLLC) arose to serve industrial IoT (IIoT) use cases within the 5G. Currently, it has inherent limitations to support future services. Based on state-of-the-art research and practical deployment experience, in this article, we introduce and advocate for three variants: broadband, scalable and extreme URLLC. We discuss use cases and key performance indica…
▽ More
Ultra-reliable low latency communications (URLLC) arose to serve industrial IoT (IIoT) use cases within the 5G. Currently, it has inherent limitations to support future services. Based on state-of-the-art research and practical deployment experience, in this article, we introduce and advocate for three variants: broadband, scalable and extreme URLLC. We discuss use cases and key performance indicators and identify technology enablers for the new service modes. We bring practical considerations from the IIoT testbed and provide an outlook toward some new research directions.
△ Less
Submitted 16 June, 2022; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Influential Rank: A New Perspective of Post-training for Robust Model against Noisy Labels
Authors:
Seulki Park,
Hwanjun Song,
Daeho Um,
Dae Ung Jo,
Sangdoo Yun,
Jin Young Choi
Abstract:
Deep neural network can easily overfit to even noisy labels due to its high capacity, which degrades the generalization performance of a model. To overcome this issue, we propose a new approach for learning from noisy labels (LNL) via post-training, which can significantly improve the generalization performance of any pre-trained model on noisy label data. To this end, we rather exploit the overfi…
▽ More
Deep neural network can easily overfit to even noisy labels due to its high capacity, which degrades the generalization performance of a model. To overcome this issue, we propose a new approach for learning from noisy labels (LNL) via post-training, which can significantly improve the generalization performance of any pre-trained model on noisy label data. To this end, we rather exploit the overfitting property of a trained model to identify mislabeled samples. Specifically, our post-training approach gradually removes samples with high influence on the decision boundary and refines the decision boundary to improve generalization performance. Our post-training approach creates great synergies when combined with the existing LNL methods. Experimental results on various real-world and synthetic benchmark datasets demonstrate the validity of our approach in diverse realistic scenarios.
△ Less
Submitted 19 April, 2023; v1 submitted 14 June, 2021;
originally announced June 2021.
-
TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary Generator
Authors:
Doyeon Kim,
Donggyu Joo,
Junmo Kim
Abstract:
Advances in technology have led to the development of methods that can create desired visual multimedia. In particular, image generation using deep learning has been extensively studied across diverse fields. In comparison, video generation, especially on conditional inputs, remains a challenging and less explored area. To narrow this gap, we aim to train our model to produce a video corresponding…
▽ More
Advances in technology have led to the development of methods that can create desired visual multimedia. In particular, image generation using deep learning has been extensively studied across diverse fields. In comparison, video generation, especially on conditional inputs, remains a challenging and less explored area. To narrow this gap, we aim to train our model to produce a video corresponding to a given text description. We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video. In the first phase, we focus on creating a high-quality single video frame while learning the relationship between the text and an image. As the steps proceed, our model is trained gradually on more number of consecutive frames.This step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions. Qualitative and quantitative experimental results on various datasets demonstrate the effectiveness of the proposed method.
△ Less
Submitted 27 June, 2021; v1 submitted 4 September, 2020;
originally announced September 2020.
-
Class-Attentive Diffusion Network for Semi-Supervised Classification
Authors:
Jongin Lim,
Daeho Um,
Hyung Jin Chang,
Dae Ung Jo,
Jin Young Choi
Abstract:
Recently, graph neural networks for semi-supervised classification have been widely studied. However, existing methods only use the information of limited neighbors and do not deal with the inter-class connections in graphs. In this paper, we propose Adaptive aggregation with Class-Attentive Diffusion (AdaCAD), a new aggregation scheme that adaptively aggregates nodes probably of the same class am…
▽ More
Recently, graph neural networks for semi-supervised classification have been widely studied. However, existing methods only use the information of limited neighbors and do not deal with the inter-class connections in graphs. In this paper, we propose Adaptive aggregation with Class-Attentive Diffusion (AdaCAD), a new aggregation scheme that adaptively aggregates nodes probably of the same class among K-hop neighbors. To this end, we first propose a novel stochastic process, called Class-Attentive Diffusion (CAD), that strengthens attention to intra-class nodes and attenuates attention to inter-class nodes. In contrast to the existing diffusion methods with a transition matrix determined solely by the graph structure, CAD considers both the node features and the graph structure with the design of our class-attentive transition matrix that utilizes a classifier. Then, we further propose an adaptive update scheme that leverages different reflection ratios of the diffusion result for each node depending on the local class-context. As the main advantage, AdaCAD alleviates the problem of undesired mixing of inter-class features caused by discrepancies between node labels and the graph topology. Built on AdaCAD, we construct a simple model called Class-Attentive Diffusion Network (CAD-Net). Extensive experiments on seven benchmark datasets consistently demonstrate the efficacy of the proposed method and our CAD-Net significantly outperforms the state-of-the-art methods. Code is available at https://github.com/ljin0429/CAD-Net.
△ Less
Submitted 29 December, 2020; v1 submitted 17 June, 2020;
originally announced June 2020.
-
Streaming Language Identification using Combination of Acoustic Representations and ASR Hypotheses
Authors:
Chander Chandak,
Zeynab Raeesy,
Ariya Rastrow,
Yuzong Liu,
Xiangyang Huang,
Siyu Wang,
Dong Kwon Joo,
Roland Maas
Abstract:
This paper presents our modeling and architecture approaches for building a highly accurate low-latency language identification system to support multilingual spoken queries for voice assistants. A common approach to solve multilingual speech recognition is to run multiple monolingual ASR systems in parallel and rely on a language identification (LID) component that detects the input language. Con…
▽ More
This paper presents our modeling and architecture approaches for building a highly accurate low-latency language identification system to support multilingual spoken queries for voice assistants. A common approach to solve multilingual speech recognition is to run multiple monolingual ASR systems in parallel and rely on a language identification (LID) component that detects the input language. Conventionally, LID relies on acoustic only information to detect input language. We propose an approach that learns and combines acoustic level representations with embeddings estimated on ASR hypotheses resulting in up to 50% relative reduction of identification error rate, compared to a model that uses acoustic only features. Furthermore, to reduce the processing cost and latency, we exploit a streaming architecture to identify the spoken language early when the system reaches a predetermined confidence level, alleviating the need to run multiple ASR systems until the end of input query. The combined acoustic and text LID, coupled with our proposed streaming runtime architecture, results in an average of 1500ms early identification for more than 50% of utterances, with almost no degradation in accuracy. We also show improved results by adopting a semi-supervised learning (SSL) technique using the newly proposed model architecture as a teacher model.
△ Less
Submitted 1 June, 2020;
originally announced June 2020.
-
Token Manipulation Generative Adversarial Network for Text Generation
Authors:
DaeJin Jo
Abstract:
MaskGAN opens the query for the conditional language model by filling in the blanks between the given tokens. In this paper, we focus on addressing the limitations caused by having to specify blanks to be filled. We decompose conditional text generation problem into two tasks, make-a-blank and fill-in-the-blank, and extend the former to handle more complex manipulations on the given tokens. We cas…
▽ More
MaskGAN opens the query for the conditional language model by filling in the blanks between the given tokens. In this paper, we focus on addressing the limitations caused by having to specify blanks to be filled. We decompose conditional text generation problem into two tasks, make-a-blank and fill-in-the-blank, and extend the former to handle more complex manipulations on the given tokens. We cast these tasks as a hierarchical multi agent RL problem and introduce a conditional adversarial learning that allows the agents to reach a goal, producing realistic texts, in cooperative setting. We show that the proposed model not only addresses the limitations but also provides good results without compromising the performance in terms of quality and diversity.
△ Less
Submitted 11 May, 2020; v1 submitted 6 May, 2020;
originally announced May 2020.
-
Continual Learning with Extended Kronecker-factored Approximate Curvature
Authors:
Janghyeon Lee,
Hyeong Gwon Hong,
Donggyu Joo,
Junmo Kim
Abstract:
We propose a quadratic penalty method for continual learning of neural networks that contain batch normalization (BN) layers. The Hessian of a loss function represents the curvature of the quadratic penalty function, and a Kronecker-factored approximate curvature (K-FAC) is used widely to practically compute the Hessian of a neural network. However, the approximation is not valid if there is depen…
▽ More
We propose a quadratic penalty method for continual learning of neural networks that contain batch normalization (BN) layers. The Hessian of a loss function represents the curvature of the quadratic penalty function, and a Kronecker-factored approximate curvature (K-FAC) is used widely to practically compute the Hessian of a neural network. However, the approximation is not valid if there is dependence between examples, typically caused by BN layers in deep network architectures. We extend the K-FAC method so that the inter-example relations are taken into account and the Hessian of deep neural networks can be properly approximated under practical assumptions. We also propose a method of weight merging and reparameterization to properly handle statistical parameters of BN, which plays a critical role for continual learning with BN, and a method that selects hyperparameters without source task data. Our method shows better performance than baselines in the permuted MNIST task with BN layers and in sequential learning from the ImageNet classification task to fine-grained classification tasks with ResNet-50, without any explicit or implicit use of source task data for hyperparameter selection.
△ Less
Submitted 16 April, 2020;
originally announced April 2020.
-
Residual Continual Learning
Authors:
Janghyeon Lee,
Donggyu Joo,
Hyeong Gwon Hong,
Junmo Kim
Abstract:
We propose a novel continual learning method called Residual Continual Learning (ResCL). Our method can prevent the catastrophic forgetting phenomenon in sequential learning of multiple tasks, without any source task information except the original network. ResCL reparameterizes network parameters by linearly combining each layer of the original network and a fine-tuned network; therefore, the siz…
▽ More
We propose a novel continual learning method called Residual Continual Learning (ResCL). Our method can prevent the catastrophic forgetting phenomenon in sequential learning of multiple tasks, without any source task information except the original network. ResCL reparameterizes network parameters by linearly combining each layer of the original network and a fine-tuned network; therefore, the size of the network does not increase at all. To apply the proposed method to general convolutional neural networks, the effects of batch normalization layers are also considered. By utilizing residual-learning-like reparameterization and a special weight decay loss, the trade-off between source and target performance is effectively controlled. The proposed method exhibits state-of-the-art performance in various continual learning scenarios.
△ Less
Submitted 17 February, 2020;
originally announced February 2020.
-
Cross-modal Variational Auto-encoder with Distributed Latent Spaces and Associators
Authors:
Dae Ung Jo,
ByeongJu Lee,
Jongwon Choi,
Haanju Yoo,
Jin Young Choi
Abstract:
In this paper, we propose a novel structure for a cross-modal data association, which is inspired by the recent research on the associative learning structure of the brain. We formulate the cross-modal association in Bayesian inference framework realized by a deep neural network with multiple variational auto-encoders and variational associators. The variational associators transfer the latent spa…
▽ More
In this paper, we propose a novel structure for a cross-modal data association, which is inspired by the recent research on the associative learning structure of the brain. We formulate the cross-modal association in Bayesian inference framework realized by a deep neural network with multiple variational auto-encoders and variational associators. The variational associators transfer the latent spaces between auto-encoders that represent different modalities. The proposed structure successfully associates even heterogeneous modal data and easily incorporates the additional modality to the entire network via the proposed cross-modal associator. Furthermore, the proposed structure can be trained with only a small amount of paired data since auto-encoders can be trained by unsupervised manner. Through experiments, the effectiveness of the proposed structure is validated on various datasets including visual and auditory data.
△ Less
Submitted 30 May, 2019;
originally announced May 2019.
-
Backbone Can Not be Trained at Once: Rolling Back to Pre-trained Network for Person Re-Identification
Authors:
Youngmin Ro,
Jongwon Choi,
Dae Ung Jo,
Byeongho Heo,
Jongin Lim,
Jin Young Choi
Abstract:
In person re-identification (ReID) task, because of its shortage of trainable dataset, it is common to utilize fine-tuning method using a classification network pre-trained on a large dataset. However, it is relatively difficult to sufficiently fine-tune the low-level layers of the network due to the gradient vanishing problem. In this work, we propose a novel fine-tuning strategy that allows low-…
▽ More
In person re-identification (ReID) task, because of its shortage of trainable dataset, it is common to utilize fine-tuning method using a classification network pre-trained on a large dataset. However, it is relatively difficult to sufficiently fine-tune the low-level layers of the network due to the gradient vanishing problem. In this work, we propose a novel fine-tuning strategy that allows low-level layers to be sufficiently trained by rolling back the weights of high-level layers to their initial pre-trained weights. Our strategy alleviates the problem of gradient vanishing in low-level layers and robustly trains the low-level layers to fit the ReID dataset, thereby increasing the performance of ReID tasks. The improved performance of the proposed strategy is validated via several experiments. Furthermore, without any add-ons such as pose estimation or segmentation, our strategy exhibits state-of-the-art performance using only vanilla deep convolutional neural network architecture.
△ Less
Submitted 18 January, 2019;
originally announced January 2019.
-
A Proof of the Beierle-Kranz-Leander Conjecture related to Lightweight Multiplication in $\mathds{F}_{2^n}$
Authors:
Sihem Mesnager,
Kwang Ho Kim,
Dujin Jo,
Junyop Choe,
Munhyon Han,
Dok Nam Lee
Abstract:
Lightweight cryptography is a key tool for building strong security solutions for pervasive devices with limited resources. Due to the stringent cost constraints inherent in extremely large applications (ranging from RFIDs and smart cards to mobile devices), the efficient implementation of cryptographic hardware and software algorithms is of utmost importance to realize the vision of generalized c…
▽ More
Lightweight cryptography is a key tool for building strong security solutions for pervasive devices with limited resources. Due to the stringent cost constraints inherent in extremely large applications (ranging from RFIDs and smart cards to mobile devices), the efficient implementation of cryptographic hardware and software algorithms is of utmost importance to realize the vision of generalized computing.
In CRYPTO 2016, Beierle, Kranz and Leander have considered lightweight multiplication in $\mathds{F}_{2^n}$. Specifically, they have considered the fundamental question of optimizing finite field multiplications with one fixed element and investigated which field representation, that is which choice of basis, allows for an optimal implementation. They have left open a conjecture related to two XOR-count. Using the theory of linear algebra, we prove in the present paper that their conjecture is correct. Consequently, this proved conjecture can be used as a reference for further developing and implementing cryptography algorithms in lightweight devices.
△ Less
Submitted 23 December, 2018;
originally announced December 2018.
-
Generating a Fusion Image: One's Identity and Another's Shape
Authors:
Donggyu Joo,
Doyeon Kim,
Junmo Kim
Abstract:
Generating a novel image by manipulating two input images is an interesting research problem in the study of generative adversarial networks (GANs). We propose a new GAN-based network that generates a fusion image with the identity of input image x and the shape of input image y. Our network can simultaneously train on more than two image datasets in an unsupervised manner. We define an identity l…
▽ More
Generating a novel image by manipulating two input images is an interesting research problem in the study of generative adversarial networks (GANs). We propose a new GAN-based network that generates a fusion image with the identity of input image x and the shape of input image y. Our network can simultaneously train on more than two image datasets in an unsupervised manner. We define an identity loss LI to catch the identity of image x and a shape loss LS to get the shape of y. In addition, we propose a novel training method called Min-Patch training to focus the generator on crucial parts of an image, rather than its entirety. We show qualitative results on the VGG Youtube Pose dataset, Eye dataset (MPIIGaze and UnityEyes), and the Photo-Sketch-Cartoon dataset.
△ Less
Submitted 25 January, 2022; v1 submitted 20 April, 2018;
originally announced April 2018.
-
Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras
Authors:
Keunwoo Choi,
Deokjin Joo,
Juho Kim
Abstract:
We introduce Kapre, Keras layers for audio and music signal preprocessing. Music research using deep neural networks requires a heavy and tedious preprocessing stage, for which audio processing parameters are often ignored in parameter optimisation. To solve this problem, Kapre implements time-frequency conversions, normalisation, and data augmentation as Keras layers. We report simple benchmark r…
▽ More
We introduce Kapre, Keras layers for audio and music signal preprocessing. Music research using deep neural networks requires a heavy and tedious preprocessing stage, for which audio processing parameters are often ignored in parameter optimisation. To solve this problem, Kapre implements time-frequency conversions, normalisation, and data augmentation as Keras layers. We report simple benchmark results, showing real-time on-GPU preprocessing adds a reasonable amount of computation.
△ Less
Submitted 19 June, 2017;
originally announced June 2017.
-
Automatic Content-aware Projection for 360° Videos
Authors:
Yeong Won Kim,
Dae-Yong Jo,
Chang-Ryeol Lee,
Hyeok-Jae Choi,
Yong Hoon Kwon,
Kuk-Jin Yoon
Abstract:
To watch 360° videos on normal 2D displays, we need to project the selected part of the 360° image onto the 2D display plane. In this paper, we propose a fully-automated framework for generating content-aware 2D normal-view perspective videos from 360° videos. Especially, we focus on the projection step preserving important image contents and reducing image distortion. Basically, our projection me…
▽ More
To watch 360° videos on normal 2D displays, we need to project the selected part of the 360° image onto the 2D display plane. In this paper, we propose a fully-automated framework for generating content-aware 2D normal-view perspective videos from 360° videos. Especially, we focus on the projection step preserving important image contents and reducing image distortion. Basically, our projection method is based on Pannini projection model. At first, the salient contents such as linear structures and salient regions in the image are preserved by optimizing the single Panini projection model. Then, the multiple Panini projection models at salient regions are interpolated to suppress image distortion globally. Finally, the temporal consistency for image projection is enforced for producing temporally stable normal-view videos. Our proposed projection method does not require any user-interaction and is much faster than previous content-preserving methods. It can be applied to not only images but also videos taking the temporal consistency of projection into account. Experiments on various 360° videos show the superiority of the proposed projection method quantitatively and qualitatively.
△ Less
Submitted 10 September, 2017; v1 submitted 24 April, 2017;
originally announced April 2017.