subscribe to arXiv mailings

Automatic Creative Selection with Cross-Modal Matching

Authors: Alex Kim, Jia Huang, Rob Monarch, Jerry Kwac, Anikesh Kamath, Parmeshwar Khurd, Kailash Thiyagarajan, Goodman Gu

Abstract: Application developers advertise their Apps by creating product pages with App images, and bidding on search terms. It is then crucial for App images to be highly relevant with the search terms. Solutions to this problem require an image-text matching model to predict the quality of the match between the chosen image and the search terms. In this work, we present a novel approach to matching an Ap… ▽ More Application developers advertise their Apps by creating product pages with App images, and bidding on search terms. It is then crucial for App images to be highly relevant with the search terms. Solutions to this problem require an image-text matching model to predict the quality of the match between the chosen image and the search terms. In this work, we present a novel approach to matching an App image to search terms based on fine-tuning a pre-trained LXMERT model. We show that compared to the CLIP model and a baseline using a Transformer model for search terms, and a ResNet model for images, we significantly improve the matching accuracy. We evaluate our approach using two sets of labels: advertiser associated (image, search term) pairs for a given application, and human ratings for the relevance between (image, search term) pairs. Our approach achieves 0.96 AUC score for advertiser associated ground truth, outperforming the transformer+ResNet baseline and the fine-tuned CLIP model by 8% and 14%. For human labeled ground truth, our approach achieves 0.95 AUC score, outperforming the transformer+ResNet baseline and the fine-tuned CLIP model by 16% and 17%. △ Less

Submitted 28 February, 2024; originally announced May 2024.

arXiv:2403.07240 [pdf, other]

Frequency-Aware Deepfake Detection: Improving Generalizability through Frequency Space Learning

Authors: Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, Yunchao Wei

Abstract: This research addresses the challenge of developing a universal deepfake detector that can effectively identify unseen deepfake images despite limited training data. Existing frequency-based paradigms have relied on frequency-level artifacts introduced during the up-sampling in GAN pipelines to detect forgeries. However, the rapid advancements in synthesis technology have led to specific artifacts… ▽ More This research addresses the challenge of developing a universal deepfake detector that can effectively identify unseen deepfake images despite limited training data. Existing frequency-based paradigms have relied on frequency-level artifacts introduced during the up-sampling in GAN pipelines to detect forgeries. However, the rapid advancements in synthesis technology have led to specific artifacts for each generation model. Consequently, these detectors have exhibited a lack of proficiency in learning the frequency domain and tend to overfit to the artifacts present in the training data, leading to suboptimal performance on unseen sources. To address this issue, we introduce a novel frequency-aware approach called FreqNet, centered around frequency domain learning, specifically designed to enhance the generalizability of deepfake detectors. Our method forces the detector to continuously focus on high-frequency information, exploiting high-frequency representation of features across spatial and channel dimensions. Additionally, we incorporate a straightforward frequency domain learning module to learn source-agnostic features. It involves convolutional layers applied to both the phase spectrum and amplitude spectrum between the Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (iFFT). Extensive experimentation involving 17 GANs demonstrates the effectiveness of our proposed method, showcasing state-of-the-art performance (+9.8\%) while requiring fewer parameters. The code is available at {\cred \url{https://github.com/chuangchuangtan/FreqNet-DeepfakeDetection}}. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: 9 pages, 4 figures, AAAI24

arXiv:2402.10251 [pdf, other]

Brant-2: Foundation Model for Brain Signals

Authors: Zhizhang Yuan, Daoze Zhang, Junru Chen, Gefei Gu, Yang Yang

Abstract: Foundational models benefit from pre-training on large amounts of unlabeled data and enable strong performance in a wide variety of applications with a small amount of labeled data. Such models can be particularly effective in analyzing brain signals, as this field encompasses numerous application scenarios, and it is costly to perform large-scale annotation. In this work, we present the largest f… ▽ More Foundational models benefit from pre-training on large amounts of unlabeled data and enable strong performance in a wide variety of applications with a small amount of labeled data. Such models can be particularly effective in analyzing brain signals, as this field encompasses numerous application scenarios, and it is costly to perform large-scale annotation. In this work, we present the largest foundation model in brain signals, Brant-2. Compared to Brant, a foundation model designed for intracranial neural signals, Brant-2 not only exhibits robustness towards data variations and modeling scales but also can be applied to a broader range of brain neural data. By experimenting on an extensive range of tasks, we demonstrate that Brant-2 is adaptive to various application scenarios in brain signals. Further analyses reveal the scalability of the Brant-2, validate each component's effectiveness, and showcase our model's ability to maintain performance in scenarios with scarce labels. △ Less

Submitted 28 March, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: 14 pages, 7 figures

arXiv:2312.10461 [pdf, other]

Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection

Authors: Chuangchuang Tan, Huan Liu, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, Yunchao Wei

Abstract: Recently, the proliferation of highly realistic synthetic images, facilitated through a variety of GANs and Diffusions, has significantly heightened the susceptibility to misuse. While the primary focus of deepfake detection has traditionally centered on the design of detection algorithms, an investigative inquiry into the generator architectures has remained conspicuously absent in recent years.… ▽ More Recently, the proliferation of highly realistic synthetic images, facilitated through a variety of GANs and Diffusions, has significantly heightened the susceptibility to misuse. While the primary focus of deepfake detection has traditionally centered on the design of detection algorithms, an investigative inquiry into the generator architectures has remained conspicuously absent in recent years. This paper contributes to this lacuna by rethinking the architectures of CNN-based generators, thereby establishing a generalized representation of synthetic artifacts. Our findings illuminate that the up-sampling operator can, beyond frequency-based artifacts, produce generalized forgery artifacts. In particular, the local interdependence among image pixels caused by upsampling operators is significantly demonstrated in synthetic images generated by GAN or diffusion. Building upon this observation, we introduce the concept of Neighboring Pixel Relationships(NPR) as a means to capture and characterize the generalized structural artifacts stemming from up-sampling operations. A comprehensive analysis is conducted on an open-world dataset, comprising samples generated by \tft{28 distinct generative models}. This analysis culminates in the establishment of a novel state-of-the-art performance, showcasing a remarkable \tft{11.6\%} improvement over existing methods. The code is available at https://github.com/chuangchuangtan/NPR-DeepfakeDetection. △ Less

Submitted 20 December, 2023; v1 submitted 16 December, 2023; originally announced December 2023.

Comments: 10 pages, 4 figures

arXiv:2312.01998 [pdf, other]

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Authors: Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, Sangdoo Yun

Abstract: Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collect… ▽ More Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then, we let the new and original texts have the same latent embedding vector. With this simple strategy, LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincir △ Less

Submitted 31 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: CVPR 2024 camera-ready; First two authors contributed equally; 17 pages, 3.1MB

arXiv:2312.01725 [pdf, other]

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Authors: Jeongho Kim, Gyojung Gu, Minho Park, Sunghyun Park, Jaegul Choo

Abstract: Given a clothing image and a person image, an image-based virtual try-on aims to generate a customized image that appears natural and accurately reflects the characteristics of the clothing image. In this work, we aim to expand the applicability of the pre-trained diffusion model so that it can be utilized independently for the virtual try-on task.The main challenge is to preserve the clothing det… ▽ More Given a clothing image and a person image, an image-based virtual try-on aims to generate a customized image that appears natural and accurately reflects the characteristics of the clothing image. In this work, we aim to expand the applicability of the pre-trained diffusion model so that it can be utilized independently for the virtual try-on task.The main challenge is to preserve the clothing details while effectively utilizing the robust generative capability of the pre-trained model. In order to tackle these issues, we propose StableVITON, learning the semantic correspondence between the clothing and the human body within the latent space of the pre-trained diffusion model in an end-to-end manner. Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process. Through our proposed novel attention total variation loss and applying augmentation, we achieve the sharp attention map, resulting in a more precise representation of clothing details. StableVITON outperforms the baselines in qualitative and quantitative evaluation, showing promising quality in arbitrary person images. Our code is available at https://github.com/rlawjdghek/StableVITON. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: 17 pages

arXiv:2311.14246 [pdf, other]

Constant-Time Wasmtime, for Real This Time: End-to-End Verified Zero-Overhead Constant-Time Programming for the Web and Beyond

Authors: Garrett Gu, Hovav Shacham

Abstract: We claim that existing techniques and tools for generating and verifying constant-time code are incomplete, since they rely on assumptions that compiler optimization passes do not break constant-timeness or that certain operations execute in constant time on the hardware. We present the first end-to-end constant-time-aware compilation process that preserves constant-time semantics at every step fr… ▽ More We claim that existing techniques and tools for generating and verifying constant-time code are incomplete, since they rely on assumptions that compiler optimization passes do not break constant-timeness or that certain operations execute in constant time on the hardware. We present the first end-to-end constant-time-aware compilation process that preserves constant-time semantics at every step from a high-level language down to microarchitectural guarantees, provided by the forthcoming ARM PSTATE.DIT feature. First, we present a new compiler-verifier suite based on the JIT-style runtime Wasmtime, modified to compile ct-wasm, a preexisting type-safe constant-time extension of WebAssembly, into ARM machine code while maintaining the constant-time property throughout all optimization passes. The resulting machine code is then fed into an automated verifier that requires no human intervention and uses static dataflow analysis in Ghidra to check the constant-timeness of the output. Our verifier leverages characteristics unique to ct-wasm-generated code in order to speed up verification while preserving both soundness and wide applicability. We also consider the resistance of our compilation and verification against speculative timing leakages such as Spectre. Finally, in order to expose ct-Wasmtime at a high level, we present a port of FaCT, a preexisting constant-time-aware DSL, to target ct-wasm. △ Less

Submitted 23 November, 2023; originally announced November 2023.

arXiv:2311.02396 [pdf, other]

Precise Robotic Needle-Threading with Tactile Perception and Reinforcement Learning

Authors: Zhenjun Yu, Wenqiang Xu, Siqiong Yao, Jieji Ren, Tutian Tang, Yutong Li, Guoying Gu, Cewu Lu

Abstract: This work presents a novel tactile perception-based method, named T-NT, for performing the needle-threading task, an application of deformable linear object (DLO) manipulation. This task is divided into two main stages: Tail-end Finding and Tail-end Insertion. In the first stage, the agent traces the contour of the thread twice using vision-based tactile sensors mounted on the gripper fingers. The… ▽ More This work presents a novel tactile perception-based method, named T-NT, for performing the needle-threading task, an application of deformable linear object (DLO) manipulation. This task is divided into two main stages: Tail-end Finding and Tail-end Insertion. In the first stage, the agent traces the contour of the thread twice using vision-based tactile sensors mounted on the gripper fingers. The two-run tracing is to locate the tail-end of the thread. In the second stage, it employs a tactile-guided reinforcement learning (RL) model to drive the robot to insert the thread into the target needle eyelet. The RL model is trained in a Unity-based simulated environment. The simulation environment supports tactile rendering which can produce realistic tactile images and thread modeling. During insertion, the position of the poke point and the center of the eyelet are obtained through a pre-trained segmentation model, Grounded-SAM, which predicts the masks for both the needle eye and thread imprints. These positions are then fed into the reinforcement learning model, aiding in a smoother transition to real-world applications. Extensive experiments on real robots are conducted to demonstrate the efficacy of our method. More experiments and videos can be found in the supplementary materials and on the website: https://sites.google.com/view/tac-needlethreading. △ Less

Submitted 4 November, 2023; originally announced November 2023.

arXiv:2310.01740 [pdf, other]

Control of Soft Pneumatic Actuators with Approximated Dynamical Modeling

Authors: Wu-Te Yang, Burak Kurkcu, Motohiro Hirao, Lingfeng Sun, Xinghao Zhu, Zhizhou Zhang, Grace X. Gu, Masayoshi Tomizuka

Abstract: This paper introduces a full system modeling strategy for a syringe pump and soft pneumatic actuators(SPAs). The soft actuator is conceptualized as a beam structure, utilizing a second-order bending model. The equation of natural frequency is derived from Euler's bending theory, while the damping ratio is estimated by fitting step responses of soft pneumatic actuators. Evaluation of model uncertai… ▽ More This paper introduces a full system modeling strategy for a syringe pump and soft pneumatic actuators(SPAs). The soft actuator is conceptualized as a beam structure, utilizing a second-order bending model. The equation of natural frequency is derived from Euler's bending theory, while the damping ratio is estimated by fitting step responses of soft pneumatic actuators. Evaluation of model uncertainty underscores the robustness of our modeling methodology. To validate our approach, we deploy it across four prototypes varying in dimensional parameters. Furthermore, a syringe pump is designed to drive the actuator, and a pressure model is proposed to construct a full system model. By employing this full system model, the Linear-Quadratic Regulator (LQR) controller is implemented to control the soft actuator, achieving high-speed responses and high accuracy in both step response and square wave function response tests. Both the modeling method and the LQR controller are thoroughly evaluated through experiments. Lastly, a gripper, consisting of two actuators with a feedback controller, demonstrates stable grasping of delicate objects with a significantly higher success rate. △ Less

Submitted 19 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: 8 pages, 10 figures, accepted by 2023 IEEE ROBIO conference

arXiv:2303.11916 [pdf, other]

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Authors: Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun

Abstract: This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR ap… ▽ More This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions. CompoDiff also shows the controllability of the condition strength between text and image queries and the trade-off between inference speed and performance, which are unavailable with existing CIR methods. The code and dataset are available at https://github.com/navervision/CompoDiff △ Less

Submitted 16 July, 2024; v1 submitted 21 March, 2023; originally announced March 2023.

Comments: TMLR camera-ready; First two authors contributed equally; TMLR Expert Certification; 30 pages, 5.9MB

arXiv:2212.04114 [pdf, other]

Group Generalized Mean Pooling for Vision Transformer

Authors: Byungsoo Ko, Han-Gyu Kim, Byeongho Heo, Sangdoo Yun, Sanghyuk Chun, Geonmo Gu, Wonjae Kim

Abstract: Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, s… ▽ More Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As ViT groups the channels via a multi-head attention mechanism, grouping the channels by GGeM leads to lower head-wise dependence while amplifying important channels on the activation maps. Exploiting GGeM shows 0.1%p to 0.7%p performance boosts compared to the baselines and achieves state-of-the-art performance for ViT-Base and ViT-Large models in ImageNet-1K classification task. Moreover, GGeM outperforms the existing pooling strategies on image retrieval and multi-modal representation learning tasks, demonstrating the superiority of GGeM for a variety of tasks. GGeM is a simple algorithm in that only a few lines of code are necessary for implementation. △ Less

Submitted 8 December, 2022; originally announced December 2022.

arXiv:2210.02254 [pdf, other]

Granularity-aware Adaptation for Image Retrieval over Multiple Tasks

Authors: Jon Almazán, Byungsoo Ko, Geonmo Gu, Diane Larlus, Yannis Kalantidis

Abstract: Strong image search models can be learned for a specific domain, ie. set of labels, provided that some labeled images of that domain are available. A practical visual search model, however, should be versatile enough to solve multiple retrieval tasks simultaneously, even if those cover very different specialized domains. Additionally, it should be able to benefit from even unlabeled images from th… ▽ More Strong image search models can be learned for a specific domain, ie. set of labels, provided that some labeled images of that domain are available. A practical visual search model, however, should be versatile enough to solve multiple retrieval tasks simultaneously, even if those cover very different specialized domains. Additionally, it should be able to benefit from even unlabeled images from these various retrieval tasks. This is the more practical scenario that we consider in this paper. We address it with the proposed Grappa, an approach that starts from a strong pretrained model, and adapts it to tackle multiple retrieval tasks concurrently, using only unlabeled images from the different task domains. We extend the pretrained model with multiple independently trained sets of adaptors that use pseudo-label sets of different sizes, effectively mimicking different pseudo-granularities. We reconcile all adaptor sets into a single unified model suited for all retrieval tasks by learning fusion layers that we guide by propagating pseudo-granularity attentions across neighbors in the feature space. Results on a benchmark composed of six heterogeneous retrieval tasks show that the unsupervised Grappa model improves the zero-shot performance of a state-of-the-art self-supervised learning model, and in some places reaches or improves over a task label-aware oracle that selects the most fitting pseudo-granularity per task. △ Less

Submitted 5 October, 2022; originally announced October 2022.

Comments: ECCV 2022

arXiv:2206.14180 [pdf, other]

High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions

Authors: Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, Jaegul Choo

Abstract: Image-based virtual try-on aims to synthesize an image of a person wearing a given clothing item. To solve the task, the existing methods warp the clothing item to fit the person's body and generate the segmentation map of the person wearing the item before fusing the item with the person. However, when the warping and the segmentation generation stages operate individually without information exc… ▽ More Image-based virtual try-on aims to synthesize an image of a person wearing a given clothing item. To solve the task, the existing methods warp the clothing item to fit the person's body and generate the segmentation map of the person wearing the item before fusing the item with the person. However, when the warping and the segmentation generation stages operate individually without information exchange, the misalignment between the warped clothes and the segmentation map occurs, which leads to the artifacts in the final image. The information disconnection also causes excessive warping near the clothing regions occluded by the body parts, so-called pixel-squeezing artifacts. To settle the issues, we propose a novel try-on condition generator as a unified module of the two stages (i.e., warping and segmentation generation stages). A newly proposed feature fusion block in the condition generator implements the information exchange, and the condition generator does not create any misalignment or pixel-squeezing artifacts. We also introduce discriminator rejection that filters out the incorrect segmentation map predictions and assures the performance of virtual try-on frameworks. Experiments on a high-resolution dataset demonstrate that our model successfully handles the misalignment and occlusion, and significantly outperforms the baselines. Code is available at https://github.com/sangyun884/HR-VITON. △ Less

Submitted 20 July, 2022; v1 submitted 28 June, 2022; originally announced June 2022.

Comments: Accepted to ECCV 2022

arXiv:2206.08585 [pdf, other]

HairFIT: Pose-Invariant Hairstyle Transfer via Flow-based Hair Alignment and Semantic-Region-Aware Inpainting

Authors: Chaeyeon Chung, Taewoo Kim, Hyelin Nam, Seunghwan Choi, Gyojung Gu, Sunghyun Park, Jaegul Choo

Abstract: Hairstyle transfer is the task of modifying a source hairstyle to a target one. Although recent hairstyle transfer models can reflect the delicate features of hairstyles, they still have two major limitations. First, the existing methods fail to transfer hairstyles when a source and a target image have different poses (e.g., viewing direction or face size), which is prevalent in the real world. Al… ▽ More Hairstyle transfer is the task of modifying a source hairstyle to a target one. Although recent hairstyle transfer models can reflect the delicate features of hairstyles, they still have two major limitations. First, the existing methods fail to transfer hairstyles when a source and a target image have different poses (e.g., viewing direction or face size), which is prevalent in the real world. Also, the previous models generate unrealistic images when there is a non-trivial amount of regions in the source image occluded by its original hair. When modifying long hair to short hair, shoulders or backgrounds occluded by the long hair need to be inpainted. To address these issues, we propose a novel framework for pose-invariant hairstyle transfer, HairFIT. Our model consists of two stages: 1) flow-based hair alignment and 2) hair synthesis. In the hair alignment stage, we leverage a keypoint-based optical flow estimator to align a target hairstyle with a source pose. Then, we generate a final hairstyle-transferred image in the hair synthesis stage based on Semantic-region-aware Inpainting Mask (SIM) estimator. Our SIM estimator divides the occluded regions in the source image into different semantic regions to reflect their distinct features during the inpainting. To demonstrate the effectiveness of our model, we conduct quantitative and qualitative evaluations using multi-view datasets, K-hairstyle and VoxCeleb. The results indicate that HairFIT achieves a state-of-the-art performance by successfully transferring hairstyles between images of different poses, which has never been achieved before. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: BMVC 2021 Oral Presentation

arXiv:2203.14463 [pdf, other]

Large-scale Bilingual Language-Image Contrastive Learning

Authors: Byungsoo Ko, Geonmo Gu

Abstract: This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than En… ▽ More This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than English. In this work, we collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP. We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation. Extensive experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages. Moreover, we discuss multimodal-related research questions: 1) strong augmentation-based methods can distract the model from learning proper multimodal relations; 2) training multimodal model without cross-lingual relation can learn the relation via visual semantics; 3) our bilingual KELIP can capture cultural differences of visual semantics for the same meaning of words; 4) a large-scale multimodal model can be used for multimodal feature analogy. We hope that this work will provide helpful experience and findings for future research. We provide an open-source pre-trained KELIP. △ Less

Submitted 14 April, 2022; v1 submitted 27 March, 2022; originally announced March 2022.

Comments: Accepted by ICLRW2022

arXiv:2112.08816 [pdf, other]

Deep Hash Distillation for Image Retrieval

Authors: Young Kyun Jang, Geonmo Gu, Byungsoo Ko, Isaac Kang, Nam Ik Cho

Abstract: In hash-based image retrieval systems, degraded or transformed inputs usually generate different codes from the original, deteriorating the retrieval accuracy. To mitigate this issue, data augmentation can be applied during training. However, even if augmented samples of an image are similar in real feature space, the quantization can scatter them far away in Hamming space. This results in represe… ▽ More In hash-based image retrieval systems, degraded or transformed inputs usually generate different codes from the original, deteriorating the retrieval accuracy. To mitigate this issue, data augmentation can be applied during training. However, even if augmented samples of an image are similar in real feature space, the quantization can scatter them far away in Hamming space. This results in representation discrepancies that can impede training and degrade performance. In this work, we propose a novel self-distilled hashing scheme to minimize the discrepancy while exploiting the potential of augmented data. By transferring the hash knowledge of the weakly-transformed samples to the strong ones, we make the hash code insensitive to various transformations. We also introduce hash proxy-based similarity learning and binary cross entropy-based quantization loss to provide fine quality hash codes. Ultimately, we construct a deep hashing framework that not only improves the existing deep hashing approaches, but also achieves the state-of-the-art retrieval results. Extensive experiments are conducted and confirm the effectiveness of our work. △ Less

Submitted 13 July, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: ECCV2022

arXiv:2106.00186 [pdf, other]

Towards Light-weight and Real-time Line Segment Detection

Authors: Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, Minchul Shin

Abstract: Previous deep learning-based line segment detection (LSD) suffers from the immense model size and high computational cost for line prediction. This constrains them from real-time inference on computationally restricted environments. In this paper, we propose a real-time and light-weight line segment detector for resource-constrained environments named Mobile LSD (M-LSD). We design an extremely eff… ▽ More Previous deep learning-based line segment detection (LSD) suffers from the immense model size and high computational cost for line prediction. This constrains them from real-time inference on computationally restricted environments. In this paper, we propose a real-time and light-weight line segment detector for resource-constrained environments named Mobile LSD (M-LSD). We design an extremely efficient LSD architecture by minimizing the backbone network and removing the typical multi-module process for line prediction found in previous methods. To maintain competitive performance with a light-weight network, we present novel training schemes: Segments of Line segment (SoL) augmentation, matching and geometric loss. SoL augmentation splits a line segment into multiple subparts, which are used to provide auxiliary line data during the training process. Moreover, the matching and geometric loss allow a model to capture additional geometric cues. Compared with TP-LSD-Lite, previously the best real-time LSD method, our model (M-LSD-tiny) achieves competitive performance with 2.5% of model size and an increase of 130.5% in inference speed on GPU. Furthermore, our model runs at 56.8 FPS and 48.6 FPS on the latest Android and iPhone mobile devices, respectively. To the best of our knowledge, this is the first real-time deep LSD available on mobile devices. Our code is available. △ Less

Submitted 26 April, 2022; v1 submitted 31 May, 2021; originally announced June 2021.

Comments: Accepted by AAAI2022

arXiv:2104.03015 [pdf, other]

RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network

Authors: Minchul Shin, Yoonjae Cho, Byungsoo Ko, Geonmo Gu

Abstract: In this paper, we study the compositional learning of images and texts for image retrieval. The query is given in the form of an image and text that describes the desired modifications to the image; the goal is to retrieve the target image that satisfies the given modifications and resembles the query by composing information in both the text and image modalities. To remedy this, we propose a nove… ▽ More In this paper, we study the compositional learning of images and texts for image retrieval. The query is given in the form of an image and text that describes the desired modifications to the image; the goal is to retrieve the target image that satisfies the given modifications and resembles the query by composing information in both the text and image modalities. To remedy this, we propose a novel architecture designed for the image-text composition task and show that the proposed structure can effectively encode the differences between the source and target images conditioned on the text. Furthermore, we introduce a new joint training technique based on the graph convolutional network that is generally applicable for any existing composition methods in a plug-and-play manner. We found that the proposed technique consistently improves performance and achieves state-of-the-art scores on various benchmarks. To avoid misleading experimental results caused by trivial training hyper-parameters, we reproduce all individual baselines and train models with a unified training environment. We expect this approach to suppress undesirable effects from irrelevant components and emphasize the image-text composition module's ability. Also, we achieve the state-of-the-art score without restricting the training environment, which implies the superiority of our method considering the gains from hyper-parameter tuning. The code, including all the baseline methods, are released https://github.com/nashory/rtic-gcn-pytorch. △ Less

Submitted 25 October, 2021; v1 submitted 7 April, 2021; originally announced April 2021.

arXiv:2103.16940 [pdf, other]

Learning with Memory-based Virtual Classes for Deep Metric Learning

Authors: Byungsoo Ko, Geonmo Gu, Han-Gyu Kim

Abstract: The core of deep metric learning (DML) involves learning visual similarities in high-dimensional embedding space. One of the main challenges is to generalize from seen classes of training data to unseen classes of test data. Recent works have focused on exploiting past embeddings to increase the number of instances for the seen classes. Such methods achieve performance improvement via augmentation… ▽ More The core of deep metric learning (DML) involves learning visual similarities in high-dimensional embedding space. One of the main challenges is to generalize from seen classes of training data to unseen classes of test data. Recent works have focused on exploiting past embeddings to increase the number of instances for the seen classes. Such methods achieve performance improvement via augmentation, while the strong focus on seen classes still remains. This can be undesirable for DML, where training and test data exhibit entirely different classes. In this work, we present a novel training strategy for DML called MemVir. Unlike previous works, MemVir memorizes both embedding features and class weights to utilize them as additional virtual classes. The exploitation of virtual classes not only utilizes augmented information for training but also alleviates a strong focus on seen classes for better generalization. Moreover, we embed the idea of curriculum learning by slowly adding virtual classes for a gradual increase in learning difficulty, which improves the learning stability as well as the final performance. MemVir can be easily applied to many existing loss functions without any modification. Extensive experimental results on famous benchmarks demonstrate the superiority of MemVir over state-of-the-art competitors. Code of MemVir is publicly available. △ Less

Submitted 8 October, 2021; v1 submitted 31 March, 2021; originally announced March 2021.

Comments: Accepted by ICCV2021

arXiv:2103.15454 [pdf, other]

Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning

Authors: Geonmo Gu, Byungsoo Ko, Han-Gyu Kim

Abstract: One of the main purposes of deep metric learning is to construct an embedding space that has well-generalized embeddings on both seen (training) classes and unseen (test) classes. Most existing works have tried to achieve this using different types of metric objectives and hard sample mining strategies with given training data. However, learning with only the training data can be overfitted to the… ▽ More One of the main purposes of deep metric learning is to construct an embedding space that has well-generalized embeddings on both seen (training) classes and unseen (test) classes. Most existing works have tried to achieve this using different types of metric objectives and hard sample mining strategies with given training data. However, learning with only the training data can be overfitted to the seen classes, leading to the lack of generalization capability on unseen classes. To address this problem, we propose a simple regularizer called Proxy Synthesis that exploits synthetic classes for stronger generalization in deep metric learning. The proposed method generates synthetic embeddings and proxies that work as synthetic classes, and they mimic unseen classes when computing proxy-based losses. Proxy Synthesis derives an embedding space considering class relations and smooth decision boundaries for robustness on unseen classes. Our method is applicable to any proxy-based losses, including softmax and its variants. Extensive experiments on four famous benchmarks in image retrieval tasks demonstrate that Proxy Synthesis significantly boosts the performance of proxy-based losses and achieves state-of-the-art performance. △ Less

Submitted 29 March, 2021; originally announced March 2021.

Comments: Accepted by AAAI2021

arXiv:2103.11526 [pdf, other]

ExAD: An Ensemble Approach for Explanation-based Adversarial Detection

Authors: Raj Vardhan, Ninghao Liu, Phakpoom Chinprutthiwong, Weijie Fu, Zhenyu Hu, Xia Ben Hu, Guofei Gu

Abstract: Recent research has shown Deep Neural Networks (DNNs) to be vulnerable to adversarial examples that induce desired misclassifications in the models. Such risks impede the application of machine learning in security-sensitive domains. Several defense methods have been proposed against adversarial attacks to detect adversarial examples at test time or to make machine learning models more robust. How… ▽ More Recent research has shown Deep Neural Networks (DNNs) to be vulnerable to adversarial examples that induce desired misclassifications in the models. Such risks impede the application of machine learning in security-sensitive domains. Several defense methods have been proposed against adversarial attacks to detect adversarial examples at test time or to make machine learning models more robust. However, while existing methods are quite effective under blackbox threat model, where the attacker is not aware of the defense, they are relatively ineffective under whitebox threat model, where the attacker has full knowledge of the defense. In this paper, we propose ExAD, a framework to detect adversarial examples using an ensemble of explanation techniques. Each explanation technique in ExAD produces an explanation map identifying the relevance of input variables for the model's classification. For every class in a dataset, the system includes a detector network, corresponding to each explanation technique, which is trained to distinguish between normal and abnormal explanation maps. At test time, if the explanation map of an input is detected as abnormal by any detector model of the classified class, then we consider the input to be an adversarial example. We evaluate our approach using six state-of-the-art adversarial attacks on three image datasets. Our extensive evaluation shows that our mechanism can effectively detect these attacks under blackbox threat model with limited false-positives. Furthermore, we find that our approach achieves promising results in limiting the success rate of whitebox attacks. △ Less

Submitted 21 March, 2021; originally announced March 2021.

Comments: 15 pages, 10 figures

arXiv:2102.06288 [pdf, other]

doi 10.1109/ICIP42928.2021.9506557

K-Hairstyle: A Large-scale Korean Hairstyle Dataset for Virtual Hair Editing and Hairstyle Classification

Authors: Taewoo Kim, Chaeyeon Chung, Sunghyun Park, Gyojung Gu, Keonmin Nam, Wonzo Choe, Jaesung Lee, Jaegul Choo

Abstract: The hair and beauty industry is a fast-growing industry. This led to the development of various applications, such as virtual hair dyeing or hairstyle transfer, to satisfy the customer's needs. Although several hairstyle datasets are available for these applications, they often consist of a relatively small number of images with low resolution, thus limiting their performance on high-quality hair… ▽ More The hair and beauty industry is a fast-growing industry. This led to the development of various applications, such as virtual hair dyeing or hairstyle transfer, to satisfy the customer's needs. Although several hairstyle datasets are available for these applications, they often consist of a relatively small number of images with low resolution, thus limiting their performance on high-quality hair editing. In response, we introduce a novel large-scale Korean hairstyle dataset, K-hairstyle, containing 500,000 high-resolution images. In addition, K-hairstyle includes various hair attributes annotated by Korean expert hairstylists as well as hair segmentation masks. We validate the effectiveness of our dataset via several applications, such as hair dyeing, hairstyle transfer, and hairstyle classification. K-hairstyle is publicly available at https://psh01087.github.io/K-Hairstyle/. △ Less

Submitted 9 October, 2021; v1 submitted 11 February, 2021; originally announced February 2021.

Comments: ICIP 2021 final version

arXiv:2101.04773 [pdf, other]

Practical Speech Re-use Prevention in Voice-driven Services

Authors: Yangyong Zhang, Maliheh Shirvanian, Sunpreet S. Arora, Jianwei Huang, Guofei Gu

Abstract: Voice-driven services (VDS) are being used in a variety of applications ranging from smart home control to payments using digital assistants. The input to such services is often captured via an open voice channel, e.g., using a microphone, in an unsupervised setting. One of the key operational security requirements in such setting is the freshness of the input speech. We present AEOLUS, a security… ▽ More Voice-driven services (VDS) are being used in a variety of applications ranging from smart home control to payments using digital assistants. The input to such services is often captured via an open voice channel, e.g., using a microphone, in an unsupervised setting. One of the key operational security requirements in such setting is the freshness of the input speech. We present AEOLUS, a security overlay that proactively embeds a dynamic acoustic nonce at the time of user interaction, and detects the presence of the embedded nonce in the recorded speech to ensure freshness. We demonstrate that acoustic nonce can (i) be reliably embedded and retrieved, and (ii) be non-disruptive (and even imperceptible) to a VDS user. Optimal parameters (acoustic nonce's operating frequency, amplitude, and bitrate) are determined for (i) and (ii) from a practical perspective. Experimental results show that AEOLUS yields 0.5% FRR at 0% FAR for speech re-use prevention upto a distance of 4 meters in three real-world environments with different background noise levels. We also conduct a user study with 120 participants, which shows that the acoustic nonce does not degrade overall user experience for 94.16% of speech samples, on average, in these environments. AEOLUS can therefore be used in practice to prevent speech re-use and ensure the freshness of speech input. △ Less

Submitted 12 January, 2021; originally announced January 2021.

arXiv:2012.03283 [pdf, other]

On the Privacy and Integrity Risks of Contact-Tracing Applications

Authors: Jianwei Huang, Vinod Yegneswaran, Phillip Porras, Guofei Gu

Abstract: Smartphone-based contact-tracing applications are at the epicenter of the global fight against the Covid-19 pandemic. While governments and healthcare agencies are eager to mandate the deployment of such applications en-masse, they face increasing scrutiny from the popular press, security companies, and human rights watch agencies that fear the exploitation of these technologies as surveillance to… ▽ More Smartphone-based contact-tracing applications are at the epicenter of the global fight against the Covid-19 pandemic. While governments and healthcare agencies are eager to mandate the deployment of such applications en-masse, they face increasing scrutiny from the popular press, security companies, and human rights watch agencies that fear the exploitation of these technologies as surveillance tools. Finding the optimal balance between community safety and privacy has been a challenge, and strategies to address these concerns have varied among countries. This paper describes two important attacks that affect a broad swath of contact-tracing applications. The first, referred to as contact-isolation attack, is a user-privacy attack that can be used to identify potentially infected patients in your neighborhood. The second is a contact-pollution attack that affects the integrity of contact tracing applications by causing them to produce a high volume of false-positive alerts. We developed prototype implementations and evaluated both attacks in the context of the DP-3T application framework, but these vulnerabilities affect a much broader class of applications. We found that both attacks are feasible and realizable with a minimal attacker work factor. We further conducted an impact assessment of these attacks by using a simulation study and measurements from the SafeGraph database. Our results indicate that attacks launched from a modest number (on the order of 10,000) of monitoring points can effectively decloak between 5-40\% of infected users in a major metropolis, such as Houston. △ Less

Submitted 8 December, 2020; v1 submitted 6 December, 2020; originally announced December 2020.

arXiv:2011.12492 [pdf]

Multi-feature driven active contour segmentation model for infrared image with intensity inhomogeneity

Authors: Qinyan Huang, Weiwen Zhou, Minjie Wan, Xin Chen, Qian Chen, Guohua Gu

Abstract: Infrared (IR) image segmentation is essential in many urban defence applications, such as pedestrian surveillance, vehicle counting, security monitoring, etc. Active contour model (ACM) is one of the most widely used image segmentation tools at present, but the existing methods only utilize the local or global single feature information of image to minimize the energy function, which is easy to ca… ▽ More Infrared (IR) image segmentation is essential in many urban defence applications, such as pedestrian surveillance, vehicle counting, security monitoring, etc. Active contour model (ACM) is one of the most widely used image segmentation tools at present, but the existing methods only utilize the local or global single feature information of image to minimize the energy function, which is easy to cause false segmentations in IR images. In this paper, we propose a multi-feature driven active contour segmentation model to handle IR images with intensity inhomogeneity. Firstly, an especially-designed signed pressure force (SPF) function is constructed by combining the global information calculated by global average gray information and the local multi-feature information calculated by local entropy, local standard deviation and gradient information. Then, we draw upon adaptive weight coefficient calculated by local range to adjust the afore-mentioned global term and local term. Next, the SPF function is substituted into the level set formulation (LSF) for further evolution. Finally, the LSF converges after a finite number of iterations, and the IR image segmentation result is obtained from the corresponding convergence result. Experimental results demonstrate that the presented method outperforms the state-of-the-art models in terms of precision rate and overlapping rate in IR test images. △ Less

Submitted 24 November, 2020; originally announced November 2020.

arXiv:2010.11724 [pdf, other]

LID 2020: The Learning from Imperfect Data Challenge Results

Authors: Yunchao Wei, Shuai Zheng, Ming-Ming Cheng, Hang Zhao, Liwei Wang, Errui Ding, Yi Yang, Antonio Torralba, Ting Liu, Guolei Sun, Wenguan Wang, Luc Van Gool, Wonho Bae, Junhyug Noh, Jinhwan Seo, Gunhee Kim, Hao Zhao, Ming Lu, Anbang Yao, Yiwen Guo, Yurong Chen, Li Zhang, Chuangchuang Tan, Tao Ruan, Guanghua Gu , et al. (10 additional authors not shown)

Abstract: Learning from imperfect data becomes an issue in many industrial applications after the research community has made profound progress in supervised learning from perfectly annotated datasets. The purpose of the Learning from Imperfect Data (LID) workshop is to inspire and facilitate the research in developing novel approaches that would harness the imperfect data and improve the data-efficiency du… ▽ More Learning from imperfect data becomes an issue in many industrial applications after the research community has made profound progress in supervised learning from perfectly annotated datasets. The purpose of the Learning from Imperfect Data (LID) workshop is to inspire and facilitate the research in developing novel approaches that would harness the imperfect data and improve the data-efficiency during training. A massive amount of user-generated data nowadays available on multiple internet services. How to leverage those and improve the machine learning models is a high impact problem. We organize the challenges in conjunction with the workshop. The goal of these challenges is to find the state-of-the-art approaches in the weakly supervised learning setting for object detection, semantic segmentation, and scene parsing. There are three tracks in the challenge, i.e., weakly supervised semantic segmentation (Track 1), weakly supervised scene parsing (Track 2), and weakly supervised object localization (Track 3). In Track 1, based on ILSVRC DET, we provide pixel-level annotations of 15K images from 200 categories for evaluation. In Track 2, we provide point-based annotations for the training set of ADE20K. In Track 3, based on ILSVRC CLS-LOC, we provide pixel-level annotations of 44,271 images for evaluation. Besides, we further introduce a new evaluation metric proposed by \cite{zhang2020rethinking}, i.e., IoU curve, to measure the quality of the generated object localization maps. This technical report summarizes the highlights from the challenge. The challenge submission server and the leaderboard will continue to open for the researchers who are interested in it. More details regarding the challenge and the benchmarks are available at https://lidchallenge.github.io △ Less

Submitted 17 October, 2020; originally announced October 2020.

Comments: Summary of the 2nd Learning from Imperfect Data Workshop in conjunction with CVPR 2020

arXiv:2010.05260 [pdf]

Infrared target tracking based on proximal robust principal component analysis method

Authors: Chao Ma, Guohua Gu, Xin Miao, Minjie Wan, Weixian Qian, Kan Ren, Qian Chen

Abstract: Infrared target tracking plays an important role in both civil and military fields. The main challenges in designing a robust and high-precision tracker for infrared sequences include overlap, occlusion and appearance change. To this end, this paper proposes an infrared target tracker based on proximal robust principal component analysis method. Firstly, the observation matrix is decomposed into a… ▽ More Infrared target tracking plays an important role in both civil and military fields. The main challenges in designing a robust and high-precision tracker for infrared sequences include overlap, occlusion and appearance change. To this end, this paper proposes an infrared target tracker based on proximal robust principal component analysis method. Firstly, the observation matrix is decomposed into a sparse occlusion matrix and a low-rank target matrix, and the constraint optimization is carried out with an approaching proximal norm which is better than L1-norm. To solve this convex optimization problem, Alternating Direction Method of Multipliers (ADMM) is employed to estimate the variables alternately. Finally, the framework of particle filter with model update strategy is exploited to locate the target. Through a series of experiments on real infrared target sequences, the effectiveness and robustness of our algorithm are proved. △ Less

Submitted 11 October, 2020; originally announced October 2020.

arXiv:2007.14995 [pdf, other]

Return-Oriented Programming in RISC-V

Authors: Garrett Gu, Hovav Shacham

Abstract: RISC-V is an open-source hardware ISA based on the RISC design principles, and has been the subject of some novel ROP mitigation technique proposals due to its open-source nature. However, very little work has actually evaluated whether such an attack is feasible assuming a typical RISC-V implementation. We show that RISC-V ROP can be used to perform Turing complete calculation and arbitrary funct… ▽ More RISC-V is an open-source hardware ISA based on the RISC design principles, and has been the subject of some novel ROP mitigation technique proposals due to its open-source nature. However, very little work has actually evaluated whether such an attack is feasible assuming a typical RISC-V implementation. We show that RISC-V ROP can be used to perform Turing complete calculation and arbitrary function calls by leveraging gadgets found in a version of the GNU libc library. Using techniques such as self-modifying ROP chains and algorithmic ROP chain generation, we demonstrate the power of RISC-V ROP by creating a compiler that converts code of arbitrary complexity written in a popular Turing-complete language into RISC-V ROP chains. △ Less

Submitted 29 July, 2020; originally announced July 2020.

arXiv:2003.02546 [pdf, other]

Embedding Expansion: Augmentation in Embedding Space for Deep Metric Learning

Authors: Byungsoo Ko, Geonmo Gu

Abstract: Learning the distance metric between pairs of samples has been studied for image retrieval and clustering. With the remarkable success of pair-based metric learning losses, recent works have proposed the use of generated synthetic points on metric learning losses for augmentation and generalization. However, these methods require additional generative networks along with the main network, which ca… ▽ More Learning the distance metric between pairs of samples has been studied for image retrieval and clustering. With the remarkable success of pair-based metric learning losses, recent works have proposed the use of generated synthetic points on metric learning losses for augmentation and generalization. However, these methods require additional generative networks along with the main network, which can lead to a larger model size, slower training speed, and harder optimization. Meanwhile, post-processing techniques, such as query expansion and database augmentation, have proposed the combination of feature points to obtain additional semantic information. In this paper, inspired by query expansion and database augmentation, we propose an augmentation method in an embedding space for pair-based metric learning losses, called embedding expansion. The proposed method generates synthetic points containing augmented information by a combination of feature points and performs hard negative pair mining to learn with the most informative feature representations. Because of its simplicity and flexibility, it can be used for existing metric learning losses without affecting model size, training speed, or optimization difficulty. Finally, the combination of embedding expansion and representative metric learning losses outperforms the state-of-the-art losses and previous sample generation methods in both image retrieval and clustering tasks. The implementation is publicly available. △ Less

Submitted 23 April, 2020; v1 submitted 5 March, 2020; originally announced March 2020.

Comments: Accepted by CVPR 2020

arXiv:2001.11658 [pdf, other]

Symmetrical Synthesis for Deep Metric Learning

Authors: Geonmo Gu, Byungsoo Ko

Abstract: Deep metric learning aims to learn embeddings that contain semantic similarity information among data points. To learn better embeddings, methods to generate synthetic hard samples have been proposed. Existing methods of synthetic hard sample generation are adopting autoencoders or generative adversarial networks, but this leads to more hyper-parameters, harder optimization, and slower training sp… ▽ More Deep metric learning aims to learn embeddings that contain semantic similarity information among data points. To learn better embeddings, methods to generate synthetic hard samples have been proposed. Existing methods of synthetic hard sample generation are adopting autoencoders or generative adversarial networks, but this leads to more hyper-parameters, harder optimization, and slower training speed. In this paper, we address these problems by proposing a novel method of synthetic hard sample generation called symmetrical synthesis. Given two original feature points from the same class, the proposed method firstly generates synthetic points with each other as an axis of symmetry. Secondly, it performs hard negative pair mining within the original and synthetic points to select a more informative negative pair for computing the metric learning loss. Our proposed method is hyper-parameter free and plug-and-play for existing metric learning losses without network modification. We demonstrate the superiority of our proposed method over existing methods for a variety of loss functions on clustering and image retrieval tasks. Our implementations is publicly available. △ Less

Submitted 23 April, 2020; v1 submitted 30 January, 2020; originally announced January 2020.

Comments: Accepted by AAAI 2020

arXiv:2001.06268 [pdf, ps, other]

Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network

Authors: Jungkyu Lee, Taeryun Won, Tae Kwan Lee, Hyemin Lee, Geonmo Gu, Kiho Hong

Abstract: Recent studies in image classification have demonstrated a variety of techniques for improving the performance of Convolutional Neural Networks (CNNs). However, attempts to combine existing techniques to create a practical model are still uncommon. In this study, we carry out extensive experiments to validate that carefully assembling these techniques and applying them to basic CNN models (e.g. Re… ▽ More Recent studies in image classification have demonstrated a variety of techniques for improving the performance of Convolutional Neural Networks (CNNs). However, attempts to combine existing techniques to create a practical model are still uncommon. In this study, we carry out extensive experiments to validate that carefully assembling these techniques and applying them to basic CNN models (e.g. ResNet and MobileNet) can improve the accuracy and robustness of the models while minimizing the loss of throughput. Our proposed assembled ResNet-50 shows improvements in top-1 accuracy from 76.3\% to 82.78\%, mCE from 76.0\% to 48.9\% and mFR from 57.7\% to 32.3\% on ILSVRC2012 validation set. With these improvements, inference throughput only decreases from 536 to 312. To verify the performance improvement in transfer learning, fine grained classification and image retrieval tasks were tested on several public datasets and showed that the improvement to backbone network performance boosted transfer learning performance significantly. Our approach achieved 1st place in the iFood Competition Fine-Grained Visual Recognition at CVPR 2019, and the source code and trained models are available at https://github.com/clovaai/assembled-cnn △ Less

Submitted 13 March, 2020; v1 submitted 17 January, 2020; originally announced January 2020.

Comments: 9 pages, 2 figures, 18 tables

arXiv:1911.01644 [pdf, other]

Fast Multiple Pattern Cartesian Tree Matching

Authors: Geonmo Gu, Siwoo Song, Simone Faro, Thierry Lecroq, Kunsoo Park

Abstract: Cartesian tree matching is the problem of finding all substrings in a given text which have the same Cartesian trees as that of a given pattern. In this paper, we deal with Cartesian tree matching for the case of multiple patterns. We present two fingerprinting methods, i.e., the parent-distance encoding and the binary encoding. By combining an efficient fingerprinting method and a conventional mu… ▽ More Cartesian tree matching is the problem of finding all substrings in a given text which have the same Cartesian trees as that of a given pattern. In this paper, we deal with Cartesian tree matching for the case of multiple patterns. We present two fingerprinting methods, i.e., the parent-distance encoding and the binary encoding. By combining an efficient fingerprinting method and a conventional multiple string matching algorithm, we can efficiently solve multiple pattern Cartesian tree matching. We propose three practical algorithms for multiple pattern Cartesian tree matching based on the Wu-Manber algorithm, the Rabin-Karp algorithm, and the Alpha Skip Search algorithm, respectively. In the experiments we compare our solutions against the previous algorithm [18]. Our solutions run faster than the previous algorithm as the pattern lengths increase. Especially, our algorithm based on Wu-Manber runs up to 33 times faster. △ Less

Submitted 5 November, 2019; originally announced November 2019.

Comments: Submitted to WALCOM 2020

arXiv:1907.11854 [pdf, other]

A Benchmark on Tricks for Large-scale Image Retrieval

Authors: Byungsoo Ko, Minchul Shin, Geonmo Gu, HeeJae Jun, Tae Kwan Lee, Youngjoon Kim

Abstract: Many studies have been performed on metric learning, which has become a key ingredient in top-performing methods of instance-level image retrieval. Meanwhile, less attention has been paid to pre-processing and post-processing tricks that can significantly boost performance. Furthermore, we found that most previous studies used small scale datasets to simplify processing. Because the behavior of a… ▽ More Many studies have been performed on metric learning, which has become a key ingredient in top-performing methods of instance-level image retrieval. Meanwhile, less attention has been paid to pre-processing and post-processing tricks that can significantly boost performance. Furthermore, we found that most previous studies used small scale datasets to simplify processing. Because the behavior of a feature representation in a deep learning model depends on both domain and data, it is important to understand how model behave in large-scale environments when a proper combination of retrieval tricks is used. In this paper, we extensively analyze the effect of well-known pre-processing, post-processing tricks, and their combination for large-scale image retrieval. We found that proper use of these tricks can significantly improve model performance without necessitating complex architecture or introducing loss, as confirmed by achieving a competitive result on the Google Landmark Retrieval Challenge 2019. △ Less

Submitted 23 April, 2020; v1 submitted 27 July, 2019; originally announced July 2019.

arXiv:1810.05399 [pdf]

Thermal Infrared Colorization via Conditional Generative Adversarial Network

Authors: Xiaodong Kuang, Xiubao Sui, Chengwei Liu, Yuan Liu, Qian Chen, Guohua Gu

Abstract: Transforming a thermal infrared image into a realistic RGB image is a challenging task. In this paper we propose a deep learning method to bridge this gap. We propose learning the transformation mapping using a coarse-to-fine generator that preserves the details. Since the standard mean squared loss cannot penalize the distance between colorized and ground truth images well, we propose a composite… ▽ More Transforming a thermal infrared image into a realistic RGB image is a challenging task. In this paper we propose a deep learning method to bridge this gap. We propose learning the transformation mapping using a coarse-to-fine generator that preserves the details. Since the standard mean squared loss cannot penalize the distance between colorized and ground truth images well, we propose a composite loss function that combines content, adversarial, perceptual and total variation losses. The content loss is used to recover global image information while the latter three losses are used to synthesize local realistic textures. Quantitative and qualitative experiments demonstrate that our approach significantly outperforms existing approaches. △ Less

Submitted 4 November, 2018; v1 submitted 12 October, 2018; originally announced October 2018.

arXiv:1808.05336 [pdf, other]

Simultaneous Localization And Mapping with depth Prediction using Capsule Networks for UAVs

Authors: Sunil Prakash, Gaelan Gu

Abstract: In this paper, we propose an novel implementation of a simultaneous localization and mapping (SLAM) system based on a monocular camera from an unmanned aerial vehicle (UAV) using Depth prediction performed with Capsule Networks (CapsNet), which possess improvements over the drawbacks of the more widely-used Convolutional Neural Networks (CNN). An Extended Kalman Filter will assist in estimating th… ▽ More In this paper, we propose an novel implementation of a simultaneous localization and mapping (SLAM) system based on a monocular camera from an unmanned aerial vehicle (UAV) using Depth prediction performed with Capsule Networks (CapsNet), which possess improvements over the drawbacks of the more widely-used Convolutional Neural Networks (CNN). An Extended Kalman Filter will assist in estimating the position of the UAV so that we are able to update the belief for the environment. Results will be evaluated on a benchmark dataset to portray the accuracy of our intended approach. △ Less

Submitted 15 August, 2018; originally announced August 2018.

arXiv:1804.09021 [pdf, other]

Label-aware Double Transfer Learning for Cross-Specialty Medical Named Entity Recognition

Authors: Zhenghui Wang, Yanru Qu, Liheng Chen, Jian Shen, Weinan Zhang, Shaodian Zhang, Yimei Gao, Gen Gu, Ken Chen, Yong Yu

Abstract: We study the problem of named entity recognition (NER) from electronic medical records, which is one of the most fundamental and critical problems for medical text mining. Medical records which are written by clinicians from different specialties usually contain quite different terminologies and writing styles. The difference of specialties and the cost of human annotation makes it particularly di… ▽ More We study the problem of named entity recognition (NER) from electronic medical records, which is one of the most fundamental and critical problems for medical text mining. Medical records which are written by clinicians from different specialties usually contain quite different terminologies and writing styles. The difference of specialties and the cost of human annotation makes it particularly difficult to train a universal medical NER system. In this paper, we propose a label-aware double transfer learning framework (La-DTL) for cross-specialty NER, so that a medical NER system designed for one specialty could be conveniently applied to another one with minimal annotation efforts. The transferability is guaranteed by two components: (i) we propose label-aware MMD for feature representation transfer, and (ii) we perform parameter transfer with a theoretical upper bound which is also label aware. We conduct extensive experiments on 12 cross-specialty NER tasks. The experimental results demonstrate that La-DTL provides consistent accuracy improvement over strong baselines. Besides, the promising experimental results on non-medical NER scenarios indicate that La-DTL is potential to be seamlessly adapted to a wide range of NER tasks. △ Less

Submitted 28 April, 2018; v1 submitted 24 April, 2018; originally announced April 2018.

Comments: NAACL HLT 2018

arXiv:1712.03534 [pdf, other]

Dynamics Transfer GAN: Generating Video by Transferring Arbitrary Temporal Dynamics from a Source Video to a Single Target Image

Authors: Wissam J. Baddar, Geonmo Gu, Sangmin Lee, Yong Man Ro

Abstract: In this paper, we propose Dynamics Transfer GAN; a new method for generating video sequences based on generative adversarial learning. The spatial constructs of a generated video sequence are acquired from the target image. The dynamics of the generated video sequence are imported from a source video sequence, with arbitrary motion, and imposed onto the target image. To preserve the spatial constr… ▽ More In this paper, we propose Dynamics Transfer GAN; a new method for generating video sequences based on generative adversarial learning. The spatial constructs of a generated video sequence are acquired from the target image. The dynamics of the generated video sequence are imported from a source video sequence, with arbitrary motion, and imposed onto the target image. To preserve the spatial construct of the target image, the appearance of the source video sequence is suppressed and only the dynamics are obtained before being imposed onto the target image. That is achieved using the proposed appearance suppressed dynamics feature. Moreover, the spatial and temporal consistencies of the generated video sequence are verified via two discriminator networks. One discriminator validates the fidelity of the generated frames appearance, while the other validates the dynamic consistency of the generated video sequence. Experiments have been conducted to verify the quality of the video sequences generated by the proposed method. The results verified that Dynamics Transfer GAN successfully transferred arbitrary dynamics of the source video sequence onto a target image when generating the output video sequence. The experimental results also showed that Dynamics Transfer GAN maintained the spatial constructs (appearance) of the target image while generating spatially and temporally consistent video sequences. △ Less

Submitted 10 December, 2017; originally announced December 2017.

arXiv:1711.10267 [pdf, other]

Differential Generative Adversarial Networks: Synthesizing Non-linear Facial Variations with Limited Number of Training Data

Authors: Geonmo Gu, Seong Tae Kim, Kihyun Kim, Wissam J. Baddar, Yong Man Ro

Abstract: In face-related applications with a public available dataset, synthesizing non-linear facial variations (e.g., facial expression, head-pose, illumination, etc.) through a generative model is helpful in addressing the lack of training data. In reality, however, there is insufficient data to even train the generative model for face synthesis. In this paper, we propose Differential Generative Adversa… ▽ More In face-related applications with a public available dataset, synthesizing non-linear facial variations (e.g., facial expression, head-pose, illumination, etc.) through a generative model is helpful in addressing the lack of training data. In reality, however, there is insufficient data to even train the generative model for face synthesis. In this paper, we propose Differential Generative Adversarial Networks (D-GAN) that can perform photo-realistic face synthesis even when training data is small. Two discriminators are devised to ensure the generator to approximate a face manifold, which can express face changes as it wants. Experimental results demonstrate that the proposed method is robust to the amount of training data and synthesized images are useful to improve the performance of a face expression classifier. △ Less

Submitted 28 December, 2017; v1 submitted 28 November, 2017; originally announced November 2017.

Comments: 20 pages

arXiv:1709.05961 [pdf, other]

Adaptive compressed 3D imaging based on wavelet trees and Hadamard multiplexing with a single photon counting detector

Authors: Huidong Dai, Weiji He, Guohua Gu, Ling Ye, Tianyi Mao, Qian Chen

Abstract: Photon counting 3D imaging allows to obtain 3D images with single-photon sensitivity and sub-ns temporal resolution. However, it is challenging to scale to high spatial resolution. In this work, we demonstrate a photon counting 3D imaging technique with short-pulsed structured illumination and a single-pixel photon counting detector. The proposed multi-resolution photon counting 3D imaging techniq… ▽ More Photon counting 3D imaging allows to obtain 3D images with single-photon sensitivity and sub-ns temporal resolution. However, it is challenging to scale to high spatial resolution. In this work, we demonstrate a photon counting 3D imaging technique with short-pulsed structured illumination and a single-pixel photon counting detector. The proposed multi-resolution photon counting 3D imaging technique acquires a high-resolution 3D image from a coarse image and edges at successfully finer resolution sampled by Hadamard multiplexing along the wavelet trees. The detected power is significantly increased thanks to the Hadamard multiplexing. Both the required measurements and the reconstruction time can be significantly reduced by performing wavelet-tree-based regions of edges predication and Hadamard demultiplexing, which makes the proposed technique suitable for scenes with high spatial resolution. The experimental results indicate that a 3D image at resolution up to 512*512 pixels can be acquired and retrieved with practical time as low as 17 seconds. △ Less

Submitted 14 September, 2017; originally announced September 2017.

Comments: 11 pages, 5 figures, 1 table

arXiv:1506.05203 [pdf, other]

Fast Multiple Order-Preserving Matching Algorithms

Authors: Myoungji Han, Munseong Kang, Sukhyeun Cho, Geonmo Gu, Jeong Seop Sim, Kunsoo Park

Abstract: Given a text $T$ and a pattern $P$, the order-preserving matching problem is to find all substrings in $T$ which have the same relative orders as $P$. Order-preserving matching has been an active research area since it was introduced by Kubica et al. \cite{kubica2013linear} and Kim et al. \cite{kim2014order}. In this paper we present two algorithms for the multiple order-preserving matching proble… ▽ More Given a text $T$ and a pattern $P$, the order-preserving matching problem is to find all substrings in $T$ which have the same relative orders as $P$. Order-preserving matching has been an active research area since it was introduced by Kubica et al. \cite{kubica2013linear} and Kim et al. \cite{kim2014order}. In this paper we present two algorithms for the multiple order-preserving matching problem, one of which runs in sublinear time on average and the other in linear time on average. Both algorithms run much faster than the previous algorithms. △ Less

Submitted 17 June, 2015; originally announced June 2015.

Comments: 15 pages, 8 figures, submitted to IWOCA 2015

Showing 1–40 of 40 results for author: Gu, G