subscribe to arXiv mailings

Bridging Optimal Transport and Jacobian Regularization by Optimal Trajectory for Enhanced Adversarial Defense

Authors: Binh M. Le, Shahroz Tariq, Simon S. Woo

Abstract: Deep neural networks, particularly in vision tasks, are notably susceptible to adversarial perturbations. To overcome this challenge, developing a robust classifier is crucial. In light of the recent advancements in the robustness of classifiers, we delve deep into the intricacies of adversarial training and Jacobian regularization, two pivotal defenses. Our work is the first carefully analyzes an… ▽ More Deep neural networks, particularly in vision tasks, are notably susceptible to adversarial perturbations. To overcome this challenge, developing a robust classifier is crucial. In light of the recent advancements in the robustness of classifiers, we delve deep into the intricacies of adversarial training and Jacobian regularization, two pivotal defenses. Our work is the first carefully analyzes and characterizes these two schools of approaches, both theoretically and empirically, to demonstrate how each approach impacts the robust learning of a classifier. Next, we propose our novel Optimal Transport with Jacobian regularization method, dubbed OTJR, bridging the input Jacobian regularization with the a output representation alignment by leveraging the optimal transport theory. In particular, we employ the Sliced Wasserstein distance that can efficiently push the adversarial samples' representations closer to those of clean samples, regardless of the number of classes within the dataset. The SW distance provides the adversarial samples' movement directions, which are much more informative and powerful for the Jacobian regularization. Our empirical evaluations set a new standard in the domain, with our method achieving commendable accuracies of 52.57% on CIFAR-10 and 28.3% on CIFAR-100 datasets under the AutoAttack. Further validating our model's practicality, we conducted real-world tests by subjecting internet-sourced images to online adversarial attacks. These demonstrations highlight our model's capability to counteract sophisticated adversarial perturbations, affirming its significance and applicability in real-world scenarios. △ Less

Submitted 12 February, 2024; v1 submitted 21 March, 2023; originally announced March 2023.

arXiv:2303.09779 [pdf, other]

Bidirectional Domain Mixup for Domain Adaptive Semantic Segmentation

Authors: Daehan Kim, Minseok Seo, Kwanyong Park, Inkyu Shin, Sanghyun Woo, In-So Kweon, Dong-Geol Choi

Abstract: Mixup provides interpolated training samples and allows the model to obtain smoother decision boundaries for better generalization. The idea can be naturally applied to the domain adaptation task, where we can mix the source and target samples to obtain domain-mixed samples for better adaptation. However, the extension of the idea from classification to segmentation (i.e., structured output) is no… ▽ More Mixup provides interpolated training samples and allows the model to obtain smoother decision boundaries for better generalization. The idea can be naturally applied to the domain adaptation task, where we can mix the source and target samples to obtain domain-mixed samples for better adaptation. However, the extension of the idea from classification to segmentation (i.e., structured output) is nontrivial. This paper systematically studies the impact of mixup under the domain adaptaive semantic segmentation task and presents a simple yet effective mixup strategy called Bidirectional Domain Mixup (BDM). In specific, we achieve domain mixup in two-step: cut and paste. Given the warm-up model trained from any adaptation techniques, we forward the source and target samples and perform a simple threshold-based cut out of the unconfident regions (cut). After then, we fill-in the dropped regions with the other domain region patches (paste). In doing so, we jointly consider class distribution, spatial structure, and pseudo label confidence. Based on our analysis, we found that BDM leaves domain transferable regions by cutting, balances the dataset-level class distribution while preserving natural scene context by pasting. We coupled our proposal with various state-of-the-art adaptation models and observe significant improvement consistently. We also provide extensive ablation experiments to empirically verify our main components of the framework. Visit our project page with the code at https://sites.google.com/view/bidirectional-domain-mixup △ Less

Submitted 17 March, 2023; originally announced March 2023.

Comments: 10 pages, 3 figures, Accepted on AAAI 2023

arXiv:2302.13156 [pdf, other]

doi 10.1145/3595353.3595882

Why Do Facial Deepfake Detectors Fail?

Authors: Binh Le, Shahroz Tariq, Alsharif Abuadbba, Kristen Moore, Simon Woo

Abstract: Recent rapid advancements in deepfake technology have allowed the creation of highly realistic fake media, such as video, image, and audio. These materials pose significant challenges to human authentication, such as impersonation, misinformation, or even a threat to national security. To keep pace with these rapid advancements, several deepfake detection algorithms have been proposed, leading to… ▽ More Recent rapid advancements in deepfake technology have allowed the creation of highly realistic fake media, such as video, image, and audio. These materials pose significant challenges to human authentication, such as impersonation, misinformation, or even a threat to national security. To keep pace with these rapid advancements, several deepfake detection algorithms have been proposed, leading to an ongoing arms race between deepfake creators and deepfake detectors. Nevertheless, these detectors are often unreliable and frequently fail to detect deepfakes. This study highlights the challenges they face in detecting deepfakes, including (1) the pre-processing pipeline of artifacts and (2) the fact that generators of new, unseen deepfake samples have not been considered when building the defense models. Our work sheds light on the need for further research and development in this field to create more robust and reliable detectors. △ Less

Submitted 10 September, 2023; v1 submitted 25 February, 2023; originally announced February 2023.

Comments: 5 pages, ACM ASIACCS 2023

arXiv:2301.04333 [pdf, other]

Learnable Path in Neural Controlled Differential Equations

Authors: Sheo Yon Jhin, Minju Jo, Seungji Kook, Noseong Park, Sungpil Woo, Sunhwan Lim

Abstract: Neural controlled differential equations (NCDEs), which are continuous analogues to recurrent neural networks (RNNs), are a specialized model in (irregular) time-series processing. In comparison with similar models, e.g., neural ordinary differential equations (NODEs), the key distinctive characteristics of NCDEs are i) the adoption of the continuous path created by an interpolation algorithm from… ▽ More Neural controlled differential equations (NCDEs), which are continuous analogues to recurrent neural networks (RNNs), are a specialized model in (irregular) time-series processing. In comparison with similar models, e.g., neural ordinary differential equations (NODEs), the key distinctive characteristics of NCDEs are i) the adoption of the continuous path created by an interpolation algorithm from each raw discrete time-series sample and ii) the adoption of the Riemann--Stieltjes integral. It is the continuous path which makes NCDEs be analogues to continuous RNNs. However, NCDEs use existing interpolation algorithms to create the path, which is unclear whether they can create an optimal path. To this end, we present a method to generate another latent path (rather than relying on existing interpolation algorithms), which is identical to learning an appropriate interpolation method. We design an encoder-decoder module based on NCDEs and NODEs, and a special training method for it. Our method shows the best performance in both time-series classification and forecasting. △ Less

Submitted 11 January, 2023; originally announced January 2023.

Comments: Accepted by AAAI 2023

arXiv:2301.00808 [pdf, other]

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Authors: Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie

Abstract: Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can a… ▽ More Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data. △ Less

Submitted 2 January, 2023; originally announced January 2023.

Comments: Code and models available at https://github.com/facebookresearch/ConvNeXt-V2

arXiv:2212.11895 [pdf, other]

doi 10.1103/PhysRevB.107.155429

Excitonic Absorption Signatures of Twisted Bilayer WSe$_{2}$ by Electron Energy-Loss Spectroscopy

Authors: Steffi Y. Woo, Alberto Zobelli, Robert Schneider, Ashish Arora, Johann A. Preuß, Benjamin J. Carey, Steffen Michaelis de Vasconcellos, Maurizia Palummo, Rudolf Bratschitsch, Luiz H. G. Tizei

Abstract: Moiré twist angle underpins the interlayer interaction of excitons in twisted van der Waals hetero- and homo-structures. The influence of twist angle on the excitonic absorption of twisted bilayer tungsten diselenide (WSe$_{2}$) has been investigated using electron energy-loss spectroscopy. Atomic-resolution imaging by scanning transmission electron microscopy was used to determine key structural… ▽ More Moiré twist angle underpins the interlayer interaction of excitons in twisted van der Waals hetero- and homo-structures. The influence of twist angle on the excitonic absorption of twisted bilayer tungsten diselenide (WSe$_{2}$) has been investigated using electron energy-loss spectroscopy. Atomic-resolution imaging by scanning transmission electron microscopy was used to determine key structural parameters, including the nanoscale measurement of the relative twist angle and stacking order. Detailed spectral analysis revealed a pronounced blueshift in the high-energy excitonic peak C with increasing twist angle, up to 200 meV when compared to the AA$^{\prime}$ stacking. The experimental findings have been discussed relative to first-principle calculations of the dielectric response of the AA$^{\prime}$ stacked bilayer WSe$_{2}$ as compared to monolayer WSe$_{2}$ by employing the \textit{GW} plus Bethe-Salpeter equation (BSE) approaches, resolving the origin of higher energy spectral features from ensembles of excitonic transitions, and thus any discrepancies between previous calculations. Furthermore, the electronic structure of moiré supercells spanning twist angles of $\sim$9.5-46.5$^{\circ}$ calculated by density functional theory (DFT) were unfolded, showing an uplifting of the conduction band minimum near the $Q$ point and minimal change in the upper valence band concurrently. The combined experiment/theory investigation provides valuable insight into the physical origins of high-energy absorption resonances in twisted bilayers, which enables to track the evolution of interlayer coupling from tuning of the exciton C transitions by absorption spectroscopy. △ Less

Submitted 22 December, 2022; originally announced December 2022.

arXiv:2212.10149 [pdf, other]

Tracking by Associating Clips

Authors: Sanghyun Woo, Kwanyong Park, Seoung Wug Oh, In So Kweon, Joon-Young Lee

Abstract: The tracking-by-detection paradigm today has become the dominant method for multi-object tracking and works by detecting objects in each frame and then performing data association across frames. However, its sequential frame-wise matching property fundamentally suffers from the intermediate interruptions in a video, such as object occlusions, fast camera movements, and abrupt light changes. Moreov… ▽ More The tracking-by-detection paradigm today has become the dominant method for multi-object tracking and works by detecting objects in each frame and then performing data association across frames. However, its sequential frame-wise matching property fundamentally suffers from the intermediate interruptions in a video, such as object occlusions, fast camera movements, and abrupt light changes. Moreover, it typically overlooks temporal information beyond the two frames for matching. In this paper, we investigate an alternative by treating object association as clip-wise matching. Our new perspective views a single long video sequence as multiple short clips, and then the tracking is performed both within and between the clips. The benefits of this new approach are two folds. First, our method is robust to tracking error accumulation or propagation, as the video chunking allows bypassing the interrupted frames, and the short clip tracking avoids the conventional error-prone long-term track memory management. Second, the multiple frame information is aggregated during the clip-wise matching, resulting in a more accurate long-range track association than the current frame-wise matching. Given the state-of-the-art tracking-by-detection tracker, QDTrack, we showcase how the tracking performance improves with our new tracking formulation. We evaluate our proposals on two tracking benchmarks, TAO and MOT17 that have complementary characteristics and challenges each other. △ Less

Submitted 20 December, 2022; originally announced December 2022.

Comments: ECCV 2022

arXiv:2212.10147 [pdf, other]

Bridging Images and Videos: A Simple Learning Framework for Large Vocabulary Video Object Detection

Authors: Sanghyun Woo, Kwanyong Park, Seoung Wug Oh, In So Kweon, Joon-Young Lee

Abstract: Scaling object taxonomies is one of the important steps toward a robust real-world deployment of recognition systems. We have faced remarkable progress in images since the introduction of the LVIS benchmark. To continue this success in videos, a new video benchmark, TAO, was recently presented. Given the recent encouraging results from both detection and tracking communities, we are interested in… ▽ More Scaling object taxonomies is one of the important steps toward a robust real-world deployment of recognition systems. We have faced remarkable progress in images since the introduction of the LVIS benchmark. To continue this success in videos, a new video benchmark, TAO, was recently presented. Given the recent encouraging results from both detection and tracking communities, we are interested in marrying those two advances and building a strong large vocabulary video tracker. However, supervisions in LVIS and TAO are inherently sparse or even missing, posing two new challenges for training the large vocabulary trackers. First, no tracking supervisions are in LVIS, which leads to inconsistent learning of detection (with LVIS and TAO) and tracking (only with TAO). Second, the detection supervisions in TAO are partial, which results in catastrophic forgetting of absent LVIS categories during video fine-tuning. To resolve these challenges, we present a simple but effective learning framework that takes full advantage of all available training data to learn detection and tracking while not losing any LVIS categories to recognize. With this new learning scheme, we show that consistent improvements of various large vocabulary trackers are capable, setting strong baseline results on the challenging TAO benchmarks. △ Less

Submitted 20 December, 2022; originally announced December 2022.

Comments: ECCV 2022

arXiv:2212.08356 [pdf, other]

Test-time Adaptation in the Dynamic World with Compound Domain Knowledge Management

Authors: Junha Song, Kwanyong Park, InKyu Shin, Sanghyun Woo, Chaoning Zhang, In So Kweon

Abstract: Prior to the deployment of robotic systems, pre-training the deep-recognition models on all potential visual cases is infeasible in practice. Hence, test-time adaptation (TTA) allows the model to adapt itself to novel environments and improve its performance during test time (i.e., lifelong adaptation). Several works for TTA have shown promising adaptation performances in continuously changing env… ▽ More Prior to the deployment of robotic systems, pre-training the deep-recognition models on all potential visual cases is infeasible in practice. Hence, test-time adaptation (TTA) allows the model to adapt itself to novel environments and improve its performance during test time (i.e., lifelong adaptation). Several works for TTA have shown promising adaptation performances in continuously changing environments. However, our investigation reveals that existing methods are vulnerable to dynamic distributional changes and often lead to overfitting of TTA models. To address this problem, this paper first presents a robust TTA framework with compound domain knowledge management. Our framework helps the TTA model to harvest the knowledge of multiple representative domains (i.e., compound domain) and conduct the TTA based on the compound domain knowledge. In addition, to prevent overfitting of the TTA model, we devise novel regularization which modulates the adaptation rates using domain-similarity between the source and the current target domain. With the synergy of the proposed framework and regularization, we achieve consistent performance improvements in diverse TTA scenarios, especially on dynamic domain shifts. We demonstrate the generality of proposals via extensive experiments including image classification on ImageNet-C and semantic segmentation on GTA5, C-driving, and corrupted Cityscapes datasets. △ Less

Submitted 15 April, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

Comments: 8 pages

arXiv:2212.08355 [pdf, other]

Learning Classifiers of Prototypes and Reciprocal Points for Universal Domain Adaptation

Authors: Sungsu Hur, Inkyu Shin, Kwanyong Park, Sanghyun Woo, In So Kweon

Abstract: Universal Domain Adaptation aims to transfer the knowledge between the datasets by handling two shifts: domain-shift and category-shift. The main challenge is correctly distinguishing the unknown target samples while adapting the distribution of known class knowledge from source to target. Most existing methods approach this problem by first training the target adapted known classifier and then re… ▽ More Universal Domain Adaptation aims to transfer the knowledge between the datasets by handling two shifts: domain-shift and category-shift. The main challenge is correctly distinguishing the unknown target samples while adapting the distribution of known class knowledge from source to target. Most existing methods approach this problem by first training the target adapted known classifier and then relying on the single threshold to distinguish unknown target samples. However, this simple threshold-based approach prevents the model from considering the underlying complexities existing between the known and unknown samples in the high-dimensional feature space. In this paper, we propose a new approach in which we use two sets of feature points, namely dual Classifiers for Prototypes and Reciprocals (CPR). Our key idea is to associate each prototype with corresponding known class features while pushing the reciprocals apart from these prototypes to locate them in the potential unknown feature space. The target samples are then classified as unknown if they fall near any reciprocals at test time. To successfully train our framework, we collect the partial, confident target samples that are classified as known or unknown through on our proposed multi-criteria selection. We then additionally apply the entropy loss regularization to them. For further adaptation, we also apply standard consistency regularization that matches the predictions of two different views of the input to make more compact target feature space. We evaluate our proposal, CPR, on three standard benchmarks and achieve comparable or new state-of-the-art results. We also provide extensive ablation experiments to verify our main design choices in our framework. △ Less

Submitted 16 December, 2022; originally announced December 2022.

Comments: Accepted at WACV 2023

arXiv:2212.04761 [pdf, other]

Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition

Authors: Jungho Lee, Minhyeok Lee, Suhwan Cho, Sungmin Woo, Sungjun Jang, Sangyoun Lee

Abstract: Skeleton-based action recognition has attracted considerable attention due to its compact representation of the human body's skeletal sructure. Many recent methods have achieved remarkable performance using graph convolutional networks (GCNs) and convolutional neural networks (CNNs), which extract spatial and temporal features, respectively. Although spatial and temporal dependencies in the human… ▽ More Skeleton-based action recognition has attracted considerable attention due to its compact representation of the human body's skeletal sructure. Many recent methods have achieved remarkable performance using graph convolutional networks (GCNs) and convolutional neural networks (CNNs), which extract spatial and temporal features, respectively. Although spatial and temporal dependencies in the human skeleton have been explored separately, spatio-temporal dependency is rarely considered. In this paper, we propose the Spatio-Temporal Curve Network (STC-Net) to effectively leverage the spatio-temporal dependency of the human skeleton. Our proposed network consists of two novel elements: 1) The Spatio-Temporal Curve (STC) module; and 2) Dilated Kernels for Graph Convolution (DK-GC). The STC module dynamically adjusts the receptive field by identifying meaningful node connections between every adjacent frame and generating spatio-temporal curves based on the identified node connections, providing an adaptive spatio-temporal coverage. In addition, we propose DK-GC to consider long-range dependencies, which results in a large receptive field without any additional parameters by applying an extended kernel to the given adjacency matrices of the graph. Our STC-Net combines these two modules and achieves state-of-the-art performance on four skeleton-based action recognition benchmarks. △ Less

Submitted 18 July, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

Comments: Accepted by ICCV 2023

arXiv:2212.04548 [pdf, other]

STLGRU: Spatio-Temporal Lightweight Graph GRU for Traffic Flow Prediction

Authors: Kishor Kumar Bhaumik, Fahim Faisal Niloy, Saif Mahmud, Simon Woo

Abstract: Reliable forecasting of traffic flow requires efficient modeling of traffic data. Indeed, different correlations and influences arise in a dynamic traffic network, making modeling a complicated task. Existing literature has proposed many different methods to capture traffic networks' complex underlying spatial-temporal relations. However, given the heterogeneity of traffic data, consistently captu… ▽ More Reliable forecasting of traffic flow requires efficient modeling of traffic data. Indeed, different correlations and influences arise in a dynamic traffic network, making modeling a complicated task. Existing literature has proposed many different methods to capture traffic networks' complex underlying spatial-temporal relations. However, given the heterogeneity of traffic data, consistently capturing both spatial and temporal dependencies presents a significant challenge. Also, as more and more sophisticated methods are being proposed, models are increasingly becoming memory-heavy and, thus, unsuitable for low-powered devices. To this end, we propose Spatio-Temporal Lightweight Graph GRU, namely STLGRU, a novel traffic forecasting model for predicting traffic flow accurately. Specifically, our proposed STLGRU can effectively capture dynamic local and global spatial-temporal relations of traffic networks using memory-augmented attention and gating mechanisms in a continuously synchronized manner. Moreover, instead of employing separate temporal and spatial components, we show that our memory module and gated unit can successfully learn the spatial-temporal dependencies with reduced memory usage and fewer parameters. Extensive experimental results on three real-world public traffic datasets demonstrate that our method can not only achieve state-of-the-art performance but also exhibit competitive computational efficiency. Our code is available at https://github.com/Kishor-Bhaumik/STLGRU △ Less

Submitted 19 February, 2024; v1 submitted 8 December, 2022; originally announced December 2022.

Comments: PAKDD 2024 (Oral)

arXiv:2211.15926 [pdf, other]

Interpretations Cannot Be Trusted: Stealthy and Effective Adversarial Perturbations against Interpretable Deep Learning

Authors: Eldor Abdukhamidov, Mohammed Abuhamad, Simon S. Woo, Eric Chan-Tin, Tamer Abuhmed

Abstract: Deep learning methods have gained increased attention in various applications due to their outstanding performance. For exploring how this high performance relates to the proper use of data artifacts and the accurate problem formulation of a given task, interpretation models have become a crucial component in developing deep learning-based systems. Interpretation models enable the understanding of… ▽ More Deep learning methods have gained increased attention in various applications due to their outstanding performance. For exploring how this high performance relates to the proper use of data artifacts and the accurate problem formulation of a given task, interpretation models have become a crucial component in developing deep learning-based systems. Interpretation models enable the understanding of the inner workings of deep learning models and offer a sense of security in detecting the misuse of artifacts in the input data. Similar to prediction models, interpretation models are also susceptible to adversarial inputs. This work introduces two attacks, AdvEdge and AdvEdge$^{+}$, that deceive both the target deep learning model and the coupled interpretation model. We assess the effectiveness of proposed attacks against two deep learning model architectures coupled with four interpretation models that represent different categories of interpretation models. Our experiments include the attack implementation using various attack frameworks. We also explore the potential countermeasures against such attacks. Our analysis shows the effectiveness of our attacks in terms of deceiving the deep learning models and their interpreters, and highlights insights to improve and circumvent the attacks. △ Less

Submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.13916 [pdf, other]

Towards Good Practices for Missing Modality Robust Action Recognition

Authors: Sangmin Woo, Sumin Lee, Yeonju Park, Muhammad Adi Nugroho, Changick Kim

Abstract: Standard multi-modal models assume the use of the same modalities in training and inference stages. However, in practice, the environment in which multi-modal models operate may not satisfy such assumption. As such, their performances degrade drastically if any modality is missing in the inference stage. We ask: how can we train a model that is robust to missing modalities? This paper seeks a set… ▽ More Standard multi-modal models assume the use of the same modalities in training and inference stages. However, in practice, the environment in which multi-modal models operate may not satisfy such assumption. As such, their performances degrade drastically if any modality is missing in the inference stage. We ask: how can we train a model that is robust to missing modalities? This paper seeks a set of good practices for multi-modal action recognition, with a particular interest in circumstances where some modalities are not available at an inference time. First, we study how to effectively regularize the model during training (e.g., data augmentation). Second, we investigate on fusion methods for robustness to missing modalities: we find that transformer-based fusion shows better robustness for missing modality than summation or concatenation. Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding by randomly dropping modality features and tries to reconstruct them with the remaining modality features. Coupling these good practices, we build a model that is not only effective in multi-modal action recognition but also robust to modality missing. Our model achieves the state-of-the-arts on multiple benchmarks and maintains competitive performances even in missing modality scenarios. Codes are available at https://github.com/sangminwoo/ActionMAE. △ Less

Submitted 30 March, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

Comments: AAAI 2023 (Oral); Code: https://github.com/sangminwoo/ActionMAE

arXiv:2210.07817 [pdf]

Discussion about Attacks and Defenses for Fair and Robust Recommendation System Design

Authors: Mirae Kim, Simon Woo

Abstract: Information has exploded on the Internet and mobile with the advent of the big data era. In particular, recommendation systems are widely used to help consumers who struggle to select the best products among such a large amount of information. However, recommendation systems are vulnerable to malicious user biases, such as fake reviews to promote or demote specific products, as well as attacks tha… ▽ More Information has exploded on the Internet and mobile with the advent of the big data era. In particular, recommendation systems are widely used to help consumers who struggle to select the best products among such a large amount of information. However, recommendation systems are vulnerable to malicious user biases, such as fake reviews to promote or demote specific products, as well as attacks that steal personal information. Such biases and attacks compromise the fairness of the recommendation model and infringe the privacy of users and systems by distorting data.Recently, deep-learning collaborative filtering recommendation systems have shown to be more vulnerable to this bias. In this position paper, we examine the effects of bias that cause various ethical and social issues, and discuss the need for designing the robust recommendation system for fairness and stability. △ Less

Submitted 28 September, 2022; originally announced October 2022.

arXiv:2210.02182 [pdf, other]

CFL-Net: Image Forgery Localization Using Contrastive Learning

Authors: Fahim Faisal Niloy, Kishor Kumar Bhaumik, Simon S. Woo

Abstract: Conventional forgery localizing methods usually rely on different forgery footprints such as JPEG artifacts, edge inconsistency, camera noise, etc., with cross-entropy loss to locate manipulated regions. However, these methods have the disadvantage of over-fitting and focusing on only a few specific forgery footprints. On the other hand, real-life manipulated images are generated via a wide variet… ▽ More Conventional forgery localizing methods usually rely on different forgery footprints such as JPEG artifacts, edge inconsistency, camera noise, etc., with cross-entropy loss to locate manipulated regions. However, these methods have the disadvantage of over-fitting and focusing on only a few specific forgery footprints. On the other hand, real-life manipulated images are generated via a wide variety of forgery operations and thus, leave behind a wide variety of forgery footprints. Therefore, we need a more general approach for image forgery localization that can work well on a variety of forgery conditions. A key assumption in underlying forged region localization is that there remains a difference of feature distribution between untampered and manipulated regions in each forged image sample, irrespective of the forgery type. In this paper, we aim to leverage this difference of feature distribution to aid in image forgery localization. Specifically, we use contrastive loss to learn mapping into a feature space where the features between untampered and manipulated regions are well-separated for each image. Also, our method has the advantage of localizing manipulated region without requiring any prior knowledge or assumption about the forgery type. We demonstrate that our work outperforms several existing methods on three benchmark image manipulation datasets. Code is available at https://github.com/niloy193/CFLNet. △ Less

Submitted 4 October, 2022; originally announced October 2022.

Comments: WACV 2023

arXiv:2209.12107 [pdf, other]

Valuation of Public Bus Electrification with Open Data

Authors: Upadhi Vijay, Soomin Woo, Scott J. Moura, Akshat Jain, David Rodriguez, Sergio Gambacorta, Giuseppe Ferrara, Luigi Lanuzza, Christian Zulberti, Erika Mellekas, Carlo Papa

Abstract: This research provides a novel framework to estimate the economic, environmental, and social values of electrifying public transit buses, for cities across the world, based on open-source data. Electric buses are a compelling candidate to replace diesel buses for the environmental and social benefits. However, the state-of-art models to evaluate the value of bus electrification are limited in appl… ▽ More This research provides a novel framework to estimate the economic, environmental, and social values of electrifying public transit buses, for cities across the world, based on open-source data. Electric buses are a compelling candidate to replace diesel buses for the environmental and social benefits. However, the state-of-art models to evaluate the value of bus electrification are limited in applicability because they require granular and bespoke data on bus operation that can be difficult to procure. Our valuation tool uses General Transit Feed Specification, a standard data format used by transit agencies worldwide, to provide high-level guidance on developing a prioritization strategy for electrifying a bus fleet. We develop physics-informed machine learning models to evaluate the energy consumption, the carbon emissions, the health impacts, and the total cost of ownership for each transit route. We demonstrate the scalability of our tool with a case study of the bus lines in the Greater Boston and Milan metropolitan areas. △ Less

Submitted 24 September, 2022; originally announced September 2022.

arXiv:2208.14625 [pdf, other]

Temporal Flow Mask Attention for Open-Set Long-Tailed Recognition of Wild Animals in Camera-Trap Images

Authors: Jeongsoo Kim, Sangmin Woo, Byeongjun Park, Changick Kim

Abstract: Camera traps, unmanned observation devices, and deep learning-based image recognition systems have greatly reduced human effort in collecting and analyzing wildlife images. However, data collected via above apparatus exhibits 1) long-tailed and 2) open-ended distribution problems. To tackle the open-set long-tailed recognition problem, we propose the Temporal Flow Mask Attention Network that compr… ▽ More Camera traps, unmanned observation devices, and deep learning-based image recognition systems have greatly reduced human effort in collecting and analyzing wildlife images. However, data collected via above apparatus exhibits 1) long-tailed and 2) open-ended distribution problems. To tackle the open-set long-tailed recognition problem, we propose the Temporal Flow Mask Attention Network that comprises three key building blocks: 1) an optical flow module, 2) an attention residual module, and 3) a meta-embedding classifier. We extract temporal features of sequential frames using the optical flow module and learn informative representation using attention residual blocks. Moreover, we show that applying the meta-embedding technique boosts the performance of the method in open-set long-tailed recognition. We apply this method on a Korean Demilitarized Zone (DMZ) dataset. We conduct extensive experiments, and quantitative and qualitative analyses to prove that our method effectively tackles the open-set long-tailed recognition problem while being robust to unknown classes. △ Less

Submitted 31 August, 2022; originally announced August 2022.

Comments: ICIP 2022

arXiv:2208.11314 [pdf, other]

Modality Mixer for Multi-modal Action Recognition

Authors: Sumin Lee, Sangmin Woo, Yeonju Park, Muhammad Adi Nugroho, Changick Kim

Abstract: In multi-modal action recognition, it is important to consider not only the complementary nature of different modalities but also global action content. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, to leverage complementary information across modalities and temporal context of an action for multi-modal action recognition. We also introduce a simple yet effecti… ▽ More In multi-modal action recognition, it is important to consider not only the complementary nature of different modalities but also global action content. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, to leverage complementary information across modalities and temporal context of an action for multi-modal action recognition. We also introduce a simple yet effective recurrent unit, called Multi-modal Contextualization Unit (MCU), which is a core component of M-Mixer. Our MCU temporally encodes a sequence of one modality (e.g., RGB) with action content features of other modalities (e.g., depth, IR). This process encourages M-Mixer to exploit global action content and also to supplement complementary information of other modalities. As a result, our proposed method outperforms state-of-the-art methods on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets. Moreover, we demonstrate the effectiveness of M-Mixer by conducting comprehensive ablation studies. △ Less

Submitted 21 February, 2023; v1 submitted 24 August, 2022; originally announced August 2022.

Comments: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023

arXiv:2208.11264 [pdf, other]

doi 10.1145/3511808.3557073

Towards an Awareness of Time Series Anomaly Detection Models' Adversarial Vulnerability

Authors: Shahroz Tariq, Binh M. Le, Simon S. Woo

Abstract: Time series anomaly detection is extensively studied in statistics, economics, and computer science. Over the years, numerous methods have been proposed for time series anomaly detection using deep learning-based methods. Many of these methods demonstrate state-of-the-art performance on benchmark datasets, giving the false impression that these systems are robust and deployable in many practical a… ▽ More Time series anomaly detection is extensively studied in statistics, economics, and computer science. Over the years, numerous methods have been proposed for time series anomaly detection using deep learning-based methods. Many of these methods demonstrate state-of-the-art performance on benchmark datasets, giving the false impression that these systems are robust and deployable in many practical and industrial real-world scenarios. In this paper, we demonstrate that the performance of state-of-the-art anomaly detection methods is degraded substantially by adding only small adversarial perturbations to the sensor data. We use different scoring metrics such as prediction errors, anomaly, and classification scores over several public and private datasets ranging from aerospace applications, server machines, to cyber-physical systems in power plants. Under well-known adversarial attacks from Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) methods, we demonstrate that state-of-the-art deep neural networks (DNNs) and graph neural networks (GNNs) methods, which claim to be robust against anomalies and have been possibly integrated in real-life systems, have their performance drop to as low as 0%. To the best of our understanding, we demonstrate, for the first time, the vulnerabilities of anomaly detection systems against adversarial attacks. The overarching goal of this research is to raise awareness towards the adversarial vulnerabilities of time series anomaly detectors. △ Less

Submitted 23 August, 2022; originally announced August 2022.

Comments: Part of Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM '22)

arXiv:2208.01924 [pdf, other]

Per-Clip Video Object Segmentation

Authors: Kwanyong Park, Sanghyun Woo, Seoung Wug Oh, In So Kweon, Joon-Young Lee

Abstract: Recently, memory-based approaches show promising results on semi-supervised video object segmentation. These methods predict object masks frame-by-frame with the help of frequently updated memory of the previous mask. Different from this per-frame inference, we investigate an alternative perspective by treating video object segmentation as clip-wise mask propagation. In this per-clip inference sch… ▽ More Recently, memory-based approaches show promising results on semi-supervised video object segmentation. These methods predict object masks frame-by-frame with the help of frequently updated memory of the previous mask. Different from this per-frame inference, we investigate an alternative perspective by treating video object segmentation as clip-wise mask propagation. In this per-clip inference scheme, we update the memory with an interval and simultaneously process a set of consecutive frames (i.e. clip) between the memory updates. The scheme provides two potential benefits: accuracy gain by clip-level optimization and efficiency gain by parallel computation of multiple frames. To this end, we propose a new method tailored for the per-clip inference. Specifically, we first introduce a clip-wise operation to refine the features based on intra-clip correlation. In addition, we employ a progressive matching mechanism for efficient information-passing within a clip. With the synergy of two modules and a newly proposed per-clip based training, our network achieves state-of-the-art performance on Youtube-VOS 2018/2019 val (84.6% and 84.6%) and DAVIS 2016/2017 val (91.9% and 86.1%). Furthermore, our model shows a great speed-accuracy trade-off with varying memory update intervals, which leads to huge flexibility. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: CVPR 2022; Code is available at https://github.com/pkyong95/PCVOS

arXiv:2208.00110 [pdf, other]

doi 10.1109/DSN53405.2022.00043

L2Fuzz: Discovering Bluetooth L2CAP Vulnerabilities Using Stateful Fuzz Testing

Authors: Haram Park, Carlos Kayembe Nkuba, Seunghoon Woo, Heejo Lee

Abstract: Bluetooth Basic Rate/Enhanced Data Rate (BR/EDR) is a wireless technology used in billions of devices. Recently, several Bluetooth fuzzing studies have been conducted to detect vulnerabilities in Bluetooth devices, but they fall short of effectively generating malformed packets. In this paper, we propose L2FUZZ, a stateful fuzzer to detect vulnerabilities in Bluetooth BR/EDR Logical Link Control a… ▽ More Bluetooth Basic Rate/Enhanced Data Rate (BR/EDR) is a wireless technology used in billions of devices. Recently, several Bluetooth fuzzing studies have been conducted to detect vulnerabilities in Bluetooth devices, but they fall short of effectively generating malformed packets. In this paper, we propose L2FUZZ, a stateful fuzzer to detect vulnerabilities in Bluetooth BR/EDR Logical Link Control and Adaptation Protocol (L2CAP) layer. By selecting valid commands for each state and mutating only the core fields of packets, L2FUZZ can generate valid malformed packets that are less likely to be rejected by the target device. Our experimental results confirmed that: (1) L2FUZZ generates up to 46 times more malformed packets with a much less packet rejection ratio compared to the existing techniques, and (2) L2FUZZ detected five zero-day vulnerabilities from eight real-world Bluetooth devices. △ Less

Submitted 29 July, 2022; originally announced August 2022.

Comments: Updated version (2022.07.30)

Journal ref: 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

arXiv:2206.12132 [pdf, other]

doi 10.21437/Interspeech.2022-46

SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech

Authors: Hyunjae Cho, Wonbin Jung, Junhyeok Lee, Sang Hoon Woo

Abstract: In this paper, we present SANE-TTS, a stable and natural end-to-end multilingual TTS model. By the difficulty of obtaining multilingual corpus for given speaker, training multilingual TTS model with monolingual corpora is unavoidable. We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in… ▽ More In this paper, we present SANE-TTS, a stable and natural end-to-end multilingual TTS model. By the difficulty of obtaining multilingual corpus for given speaker, training multilingual TTS model with monolingual corpora is unavoidable. We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in other multilingual TTS models. Furthermore, by adding speaker regularization loss, replacing speaker embedding with zero vector in duration predictor stabilizes cross-lingual inference. With this replacement, our model generates speeches with moderate rhythm regardless of source speaker in cross-lingual synthesis. In MOS evaluation, SANE-TTS achieves naturalness score above 3.80 both in cross-lingual and intralingual synthesis, where the ground truth score is 3.99. Also, SANE-TTS maintains speaker similarity close to that of ground truth even in cross-lingual inference. Audio samples are available on our web page. △ Less

Submitted 24 June, 2022; originally announced June 2022.

Comments: Accepted to Interspeech 2022

arXiv:2205.06421 [pdf, other]

Talking Face Generation with Multilingual TTS

Authors: Hyoung-Kyu Song, Sang Hoon Woo, Junhyeok Lee, Seungmin Yang, Hyunjae Cho, Youseong Lee, Dongho Choi, Kang-wook Kim

Abstract: In this work, we propose a joint system combining a talking face generation system with a text-to-speech system that can generate multilingual talking face videos from only the text input. Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker, as well as lip movements synchronized to the synthesized speech. We demonstrate the generalization cap… ▽ More In this work, we propose a joint system combining a talking face generation system with a text-to-speech system that can generate multilingual talking face videos from only the text input. Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker, as well as lip movements synchronized to the synthesized speech. We demonstrate the generalization capabilities of our system by selecting four languages (Korean, English, Japanese, and Chinese) each from a different language family. We also compare the outputs of our talking face generation model to outputs of a prior work that claims multilingual support. For our demo, we add a translation API to the preprocessing stage and present it in the form of a neural dubber so that users can utilize the multilingual property of our system more easily. △ Less

Submitted 12 May, 2022; originally announced May 2022.

Comments: Accepted to CVPR Demo Track (2022)

arXiv:2202.12520 [pdf, other]

doi 10.1126/sciadv.abq4947

Cathodoluminescence excitation spectroscopy: nanoscale imaging of excitation pathways

Authors: Nadezda Varkentina, Yves Auad, Steffi Y. Woo, Alberto Zobelli, Jean-Denis Blazit, Xiaoyan Li, Marcel Tencé, Kenji Watanabe, Takashi Taniguchi, Odile Stéphan, Mathieu Kociak, Luiz H. G. Tizei

Abstract: Following the lifespan of optical excitations from their creation to decay into photons is crucial in understanding materials optical properties. Macroscopically, techniques such as the photoluminescence excitation spectroscopy provide unique information on the photophysics of materials with applications as diverse as quantum optics or photovoltaics. Materials excitation and emission pathways are… ▽ More Following the lifespan of optical excitations from their creation to decay into photons is crucial in understanding materials optical properties. Macroscopically, techniques such as the photoluminescence excitation spectroscopy provide unique information on the photophysics of materials with applications as diverse as quantum optics or photovoltaics. Materials excitation and emission pathways are affected by nanometer scale variations directly impacting devices performances. However, they cannot be directly accessed, despite techniques, such as optical spectroscopies with free electrons, having the relevant spatial, spectral or time resolution. Here, we explore optical excitation creation and decay in two representative optical devices: plasmonic nanoparticles and luminescent 2D layers. The analysis of the energy lost by an exciting electron that is coincident in time with a visible-UV photon unveils the decay pathways from excitation towards light emission. This is demonstrated for phase-locked interactions, such as in localized surface plasmons, and non-phase-locked ones, such as the light emission by individual point defects. This newly developed cathodoluminescence excitation spectroscopy images energy transfer pathways at the nanometer scale. It widens the toolset available to explore nanoscale materials. △ Less

Submitted 8 July, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

arXiv:2202.11359

Deepfake Detection for Facial Images with Facemasks

Authors: Donggeun Ko, Sangjun Lee, Jinyong Park, Saebyeol Shin, Donghee Hong, Simon S. Woo

Abstract: Hyper-realistic face image generation and manipulation have givenrise to numerous unethical social issues, e.g., invasion of privacy,threat of security, and malicious political maneuvering, which re-sulted in the development of recent deepfake detection methods with the rising demands of deepfake forensics. Proposed deepfake detection methods to date have shown remarkable detection performance and… ▽ More Hyper-realistic face image generation and manipulation have givenrise to numerous unethical social issues, e.g., invasion of privacy,threat of security, and malicious political maneuvering, which re-sulted in the development of recent deepfake detection methods with the rising demands of deepfake forensics. Proposed deepfake detection methods to date have shown remarkable detection performance and robustness. However, none of the suggested deepfake detection methods assessed the performance of deepfakes with the facemask during the pandemic crisis after the outbreak of theCovid-19. In this paper, we thoroughly evaluate the performance of state-of-the-art deepfake detection models on the deepfakes with the facemask. Also, we propose two approaches to enhance the masked deepfakes detection: face-patch and face-crop. The experimental evaluations on both methods are assessed through the base-line deepfake detection models on the various deepfake datasets. Our extensive experiments show that, among the two methods, face-crop performs better than the face-patch, and could be a train method for deepfake detection models to detect fake faces with facemask in real world. △ Less

Submitted 23 February, 2022; originally announced February 2022.

Comments: This submission has been removed by arXiv administrators because the submitter did not have the authority to grant the license at the time of submission

arXiv:2202.04483 [pdf, other]

doi 10.1103/PhysRevMaterials.6.074005

Substrate influence on transition metal dichalcogenide monolayer exciton absorption linewidth broadening

Authors: Fuhui Shao, Steffi Y. Woo, Nianjheng Wu, Robert Schneider, Andrew J. Mayne, Steffen Michaelis de Vasconcellos, Ashish Arora, Benjamin J. Carey, Johann A. Preuß, Noémie Bonnet, Cecilia Mattevi, Kenji Watanabe, Takashi Taniguchi, Zhichuan Niu, Rudolf Bratschitsch, Luiz H. G. Tizei

Abstract: The excitonic states of transition metal dichacolgenide (TMD) monolayers are heavily influenced by their external dielectric environment based on the substrate used. In this work, various wide bandgap dielectric materials, namely hexagonal boron nitride (\textit{h}-BN) and amorphous silicon nitride (Si$_3$N$_4$), under different configurations as support or encapsulation material for WS$_2$ monola… ▽ More The excitonic states of transition metal dichacolgenide (TMD) monolayers are heavily influenced by their external dielectric environment based on the substrate used. In this work, various wide bandgap dielectric materials, namely hexagonal boron nitride (\textit{h}-BN) and amorphous silicon nitride (Si$_3$N$_4$), under different configurations as support or encapsulation material for WS$_2$ monolayers are investigated to disentangle the factors contributing to inhomogeneous broadening of exciton absorption lines in TMDs using electron energy loss spectroscopy (EELS) in a scanning transmission electron microscope (STEM). In addition, monolayer roughness in each configuration was determined from tilt series of electron diffraction patterns by assessing the broadening of diffraction spots by comparison with simulations. From our experiments, the main factors that play a role in linewidth broadening can be classified in increasing order of importance by: monolayer roughness, surface cleanliness, and substrate-induced charge trapping. Furthermore, because high-energy electrons are used as a probe, electron beam-induced damage on bare TMD monolayer is also revealed to be responsible for irreversible linewidth increases. \textit{h}-BN not only provides clean surfaces of TMD monolayer, and minimal charge disorder, but can also protect the TMD from irradiation damage. This work provides a better understanding of the mechanisms by which \textit{h}-BN remains, to date, the most compatible material for 2D material encapsulation, facilitating the realization of intrinsic material properties to their full potential. △ Less

Submitted 9 February, 2022; originally announced February 2022.

arXiv:2202.02910 [pdf, other]

doi 10.1103/PhysRevB.105.205126

Semiclassical magnetotransport including the effects of the Berry curvature and Lorentz force

Authors: Seungchan Woo, Brett Min, Hongki Min

Abstract: In topological semimetals and insulators, negative longitudinal magnetoresistance and angle-dependent planar Hall effect have been reported arising from the Berry curvature. Using the Boltzmann transport theory, we present a closed-form expression for the nonequilibrium distribution function which includes both the effects of the Berry curvature and Lorentz force. Using this formulation, we obtain… ▽ More In topological semimetals and insulators, negative longitudinal magnetoresistance and angle-dependent planar Hall effect have been reported arising from the Berry curvature. Using the Boltzmann transport theory, we present a closed-form expression for the nonequilibrium distribution function which includes both the effects of the Berry curvature and Lorentz force. Using this formulation, we obtain analytical expressions for conductivity and resistivity tensors in Weyl semimetals demonstrating a non-monotonic field dependence arising from the competition between the two effects. △ Less

Submitted 30 May, 2022; v1 submitted 6 February, 2022; originally announced February 2022.

Comments: 14 pages, 3 figures

Journal ref: Phys. Rev. B 105, 205126 (2022)

arXiv:2201.10168 [pdf, other]

Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos

Authors: Sangmin Woo, Jinyoung Park, Inyong Koo, Sumin Lee, Minki Jeong, Changick Kim

Abstract: Natural Language Video Grounding (NLVG) aims to localize time segments in an untrimmed video according to sentence queries. In this work, we present a new paradigm named Explore-And-Match for NLVG that seamlessly unifies the strengths of two streams of NLVG methods: proposal-free and proposal-based; the former explores the search space to find time segments directly, and the latter matches the pre… ▽ More Natural Language Video Grounding (NLVG) aims to localize time segments in an untrimmed video according to sentence queries. In this work, we present a new paradigm named Explore-And-Match for NLVG that seamlessly unifies the strengths of two streams of NLVG methods: proposal-free and proposal-based; the former explores the search space to find time segments directly, and the latter matches the predefined time segments with ground truths. To achieve this, we formulate NLVG as a set prediction problem and design an end-to-end trainable Language Video Transformer (LVTR) that can enjoy two favorable properties, which are rich contextualization power and parallel decoding. We train LVTR with two losses. First, temporal localization loss allows time segments of all queries to regress targets (explore). Second, set guidance loss couples every query with their respective target (match). To our surprise, we found that training schedule shows divide-and-conquer-like pattern: time segments are first diversified regardless of the target, then coupled with each target, and fine-tuned to the target again. Moreover, LVTR is highly efficient and effective: it infers faster than previous baselines (by 2X or more) and sets competitive results on two NLVG benchmarks (ActivityCaptions and Charades-STA). Codes are available at https://github.com/sangminwoo/Explore-And-Match. △ Less

Submitted 4 August, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

Comments: Code: https://github.com/sangminwoo/Explore-And-Match

arXiv:2201.07394 [pdf, other]

KappaFace: Adaptive Additive Angular Margin Loss for Deep Face Recognition

Authors: Chingis Oinar, Binh M. Le, Simon S. Woo

Abstract: Feature learning is a widely used method employed for large-scale face recognition. Recently, large-margin softmax loss methods have demonstrated significant enhancements on deep face recognition. These methods propose fixed positive margins in order to enforce intra-class compactness and inter-class diversity. However, the majority of the proposed methods do not consider the class imbalance issue… ▽ More Feature learning is a widely used method employed for large-scale face recognition. Recently, large-margin softmax loss methods have demonstrated significant enhancements on deep face recognition. These methods propose fixed positive margins in order to enforce intra-class compactness and inter-class diversity. However, the majority of the proposed methods do not consider the class imbalance issue, which is a major challenge in practice for developing deep face recognition models. We hypothesize that it significantly affects the generalization ability of the deep face models. Inspired by this observation, we introduce a novel adaptive strategy, called KappaFace, to modulate the relative importance based on class difficultness and imbalance. With the support of the von Mises-Fisher distribution, our proposed KappaFace loss can intensify the margin's magnitude for hard learning or low concentration classes while relaxing it for counter classes. Experiments conducted on popular facial benchmarks demonstrate that our proposed method achieves superior performance to the state-of-the-art. △ Less

Submitted 6 December, 2023; v1 submitted 18 January, 2022; originally announced January 2022.

arXiv:2201.06026 [pdf, other]

Toward Among-Device AI from On-Device AI with Stream Pipelines

Authors: MyungJoo Ham, Sangjung Woo, Jaeyun Jung, Wook Song, Gichan Jang, Yongjoo Ahn, Hyoung Joo Ahn

Abstract: Modern consumer electronic devices often provide intelligence services with deep neural networks. We have started migrating the computing locations of intelligence services from cloud servers (traditional AI systems) to the corresponding devices (on-device AI systems). On-device AI systems generally have the advantages of preserving privacy, removing network latency, and saving cloud costs. With t… ▽ More Modern consumer electronic devices often provide intelligence services with deep neural networks. We have started migrating the computing locations of intelligence services from cloud servers (traditional AI systems) to the corresponding devices (on-device AI systems). On-device AI systems generally have the advantages of preserving privacy, removing network latency, and saving cloud costs. With the emergent of on-device AI systems having relatively low computing power, the inconsistent and varying hardware resources and capabilities pose difficulties. Authors' affiliation has started applying a stream pipeline framework, NNStreamer, for on-device AI systems, saving developmental costs and hardware resources and improving performance. We want to expand the types of devices and applications with on-device AI services products of both the affiliation and second/third parties. We also want to make each AI service atomic, re-deployable, and shared among connected devices of arbitrary vendors; we now have yet another requirement introduced as it always has been. The new requirement of "among-device AI" includes connectivity between AI pipelines so that they may share computing resources and hardware capabilities across a wide range of devices regardless of vendors and manufacturers. We propose extensions of the stream pipeline framework, NNStreamer, for on-device AI so that NNStreamer may provide among-device AI capability. This work is a Linux Foundation (LF AI and Data) open source project accepting contributions from the general public. △ Less

Submitted 16 January, 2022; originally announced January 2022.

Comments: to appear in ICSE 2022 SEIP (preprint)

arXiv:2112.12001 [pdf, other]

DA-FDFtNet: Dual Attention Fake Detection Fine-tuning Network to Detect Various AI-Generated Fake Images

Authors: Young Oh Bang, Simon S. Woo

Abstract: Due to the advancement of Generative Adversarial Networks (GAN), Autoencoders, and other AI technologies, it has been much easier to create fake images such as "Deepfakes". More recent research has introduced few-shot learning, which uses a small amount of training data to produce fake images and videos more effectively. Therefore, the ease of generating manipulated images and the difficulty of di… ▽ More Due to the advancement of Generative Adversarial Networks (GAN), Autoencoders, and other AI technologies, it has been much easier to create fake images such as "Deepfakes". More recent research has introduced few-shot learning, which uses a small amount of training data to produce fake images and videos more effectively. Therefore, the ease of generating manipulated images and the difficulty of distinguishing those images can cause a serious threat to our society, such as propagating fake information. However, detecting realistic fake images generated by the latest AI technology is challenging due to the reasons mentioned above. In this work, we propose Dual Attention Fake Detection Fine-tuning Network (DA-FDFtNet) to detect the manipulated fake face images from the real face data. Our DA-FDFtNet integrates the pre-trained model with Fine-Tune Transformer, MBblockV3, and a channel attention module to improve the performance and robustness across different types of fake images. In particular, Fine-Tune Transformer consists of multiple numbers of an image-based self-attention module and a down-sampling layer. The channel attention module is also connected with the pre-trained model to capture the fake images feature space. We experiment with our DA-FDFtNet with the FaceForensics++ dataset and various GAN-generated datasets, and we show that our approach outperforms the previous baseline models. △ Less

Submitted 22 December, 2021; originally announced December 2021.

arXiv:2112.08050 [pdf, other]

Exploring the Asynchronous of the Frequency Spectra of GAN-generated Facial Images

Authors: Binh M. Le, Simon S. Woo

Abstract: The rapid progression of Generative Adversarial Networks (GANs) has raised a concern of their misuse for malicious purposes, especially in creating fake face images. Although many proposed methods succeed in detecting GAN-based synthetic images, they are still limited by the need for large quantities of the training fake image dataset and challenges for the detector's generalizability to unknown f… ▽ More The rapid progression of Generative Adversarial Networks (GANs) has raised a concern of their misuse for malicious purposes, especially in creating fake face images. Although many proposed methods succeed in detecting GAN-based synthetic images, they are still limited by the need for large quantities of the training fake image dataset and challenges for the detector's generalizability to unknown facial images. In this paper, we propose a new approach that explores the asynchronous frequency spectra of color channels, which is simple but effective for training both unsupervised and supervised learning models to distinguish GAN-based synthetic images. We further investigate the transferability of a training model that learns from our suggested features in one source domain and validates on another target domains with prior knowledge of the features' distribution. Our experimental results show that the discrepancy of spectra in the frequency domain is a practical artifact to effectively detect various types of GAN-based generated images. △ Less

Submitted 15 December, 2021; originally announced December 2021.

Comments: International Workshop on Safety and Security of Deep Learning IJCAI, 2021

arXiv:2112.03553 [pdf, other]

ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images

Authors: Binh M. Le, Simon S. Woo

Abstract: Despite significant advancements of deep learning-based forgery detectors for distinguishing manipulated deepfake images, most detection approaches suffer from moderate to significant performance degradation with low-quality compressed deepfake images. Because of the limited information in low-quality images, detecting low-quality deepfake remains an important challenge. In this work, we apply fre… ▽ More Despite significant advancements of deep learning-based forgery detectors for distinguishing manipulated deepfake images, most detection approaches suffer from moderate to significant performance degradation with low-quality compressed deepfake images. Because of the limited information in low-quality images, detecting low-quality deepfake remains an important challenge. In this work, we apply frequency domain learning and optimal transport theory in knowledge distillation (KD) to specifically improve the detection of low-quality compressed deepfake images. We explore transfer learning capability in KD to enable a student network to learn discriminative features from low-quality images effectively. In particular, we propose the Attention-based Deepfake detection Distiller (ADD), which consists of two novel distillations: 1) frequency attention distillation that effectively retrieves the removed high-frequency components in the student network, and 2) multi-view attention distillation that creates multiple attention vectors by slicing the teacher's and student's tensors under different views to transfer the teacher tensor's distribution to the student more efficiently. Our extensive experimental results demonstrate that our approach outperforms state-of-the-art baselines in detecting low-quality compressed deepfake images. △ Less

Submitted 7 December, 2021; originally announced December 2021.

Journal ref: Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

arXiv:2110.04111 [pdf, other]

Discover, Hallucinate, and Adapt: Open Compound Domain Adaptation for Semantic Segmentation

Authors: KwanYong Park, Sanghyun Woo, Inkyu Shin, In So Kweon

Abstract: Unsupervised domain adaptation (UDA) for semantic segmentation has been attracting attention recently, as it could be beneficial for various label-scarce real-world scenarios (e.g., robot control, autonomous driving, medical imaging, etc.). Despite the significant progress in this field, current works mainly focus on a single-source single-target setting, which cannot handle more practical setting… ▽ More Unsupervised domain adaptation (UDA) for semantic segmentation has been attracting attention recently, as it could be beneficial for various label-scarce real-world scenarios (e.g., robot control, autonomous driving, medical imaging, etc.). Despite the significant progress in this field, current works mainly focus on a single-source single-target setting, which cannot handle more practical settings of multiple targets or even unseen targets. In this paper, we investigate open compound domain adaptation (OCDA), which deals with mixed and novel situations at the same time, for semantic segmentation. We present a novel framework based on three main design principles: discover, hallucinate, and adapt. The scheme first clusters compound target data based on style, discovering multiple latent domains (discover). Then, it hallucinates multiple latent target domains in source by using image-translation (hallucinate). This step ensures the latent domains in the source and the target to be paired. Finally, target-to-source alignment is learned separately between domains (adapt). In high-level, our solution replaces a hard OCDA problem with much easier multiple UDA problems. We evaluate our solution on standard benchmark GTA to C-driving, and achieved new state-of-the-art results. △ Less

Submitted 8 October, 2021; originally announced October 2021.

Comments: NeurIPS 2020

arXiv:2109.09456 [pdf]

Multi-modal Matching Problem of Shared Mobility

Authors: Soomin Woo

Abstract: Rideshare is one way to share mobility in transportation without increasing traffic demand, where travel mobility and usage of vehicle capacity can be improved. However, current literature on rideshare has allowed only one-modal trips and may be limited in the matching efficiency, especially when there is a large gap between the supply and demand of mobility. Therefore, the objectives of this pape… ▽ More Rideshare is one way to share mobility in transportation without increasing traffic demand, where travel mobility and usage of vehicle capacity can be improved. However, current literature on rideshare has allowed only one-modal trips and may be limited in the matching efficiency, especially when there is a large gap between the supply and demand of mobility. Therefore, the objectives of this paper are first to develop a multi-modal matching framework of shared mobility with public transportation to maximize the performance of a rideshare system, and second to evaluate the effect of the public transportation and of the schedule flexibility on the matching efficiency. To fulfill the first objective, a multi-modal matching framework is developed to allow rideshare with both private and public vehicles with detailed design of detour, using Genetic Algorithm. Also for the second objective, the effects of public transportation and schedule flexibility are evaluated with a simplified network of Sioux Falls. The results show that public transportation helps the match rate slightly at a low supply of private vehicle, but this must be evaluated for practical implementation as different cities may bring different results. Also, a larger schedule flexibility helps greatly in increasing match rate even at a lower supply level. As well, the planning subject of time schedule is benefited more with larger schedule flexibility, in this paper the drivers, on the matching efficiency. Moreover, a rideshare system with private vehicles outperforms a public transportation system, possibly due to the rigid route of public transportation that takes no detour burden. This confirms the need for a flexible design of sharing mobility, as can be fulfilled with the multi-modal matching framework developed in this research. △ Less

Submitted 12 August, 2021; originally announced September 2021.

arXiv:2109.02993 [pdf, other]

doi 10.1145/3476099.3484315

Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors

Authors: Hasam Khalid, Minha Kim, Shahroz Tariq, Simon S. Woo

Abstract: Significant advancements made in the generation of deepfakes have caused security and privacy issues. Attackers can easily impersonate a person's identity in an image by replacing his face with the target person's face. Moreover, a new domain of cloning human voices using deep-learning technologies is also emerging. Now, an attacker can generate realistic cloned voices of humans using only a few s… ▽ More Significant advancements made in the generation of deepfakes have caused security and privacy issues. Attackers can easily impersonate a person's identity in an image by replacing his face with the target person's face. Moreover, a new domain of cloning human voices using deep-learning technologies is also emerging. Now, an attacker can generate realistic cloned voices of humans using only a few seconds of audio of the target person. With the emerging threat of potential harm deepfakes can cause, researchers have proposed deepfake detection methods. However, they only focus on detecting a single modality, i.e., either video or audio. On the other hand, to develop a good deepfake detector that can cope with the recent advancements in deepfake generation, we need to have a detector that can detect deepfakes of multiple modalities, i.e., videos and audios. To build such a detector, we need a dataset that contains video and respective audio deepfakes. We were able to find a most recent deepfake dataset, Audio-Video Multimodal Deepfake Detection Dataset (FakeAVCeleb), that contains not only deepfake videos but synthesized fake audios as well. We used this multimodal deepfake dataset and performed detailed baseline experiments using state-of-the-art unimodal, ensemble-based, and multimodal detection methods to evaluate it. We conclude through detailed experimentation that unimodals, addressing only a single modality, video or audio, do not perform well compared to ensemble-based methods. Whereas purely multimodal-based baselines provide the worst performance. △ Less

Submitted 7 September, 2021; originally announced September 2021.

Comments: 2 Figures, 2 Tables, Accepted for publication at the 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation and Detection (ADGD '21) at ACM MM 2021

ACM Class: I.4.9; I.5.4

arXiv:2109.01486 [pdf, other]

Studying the Effects of Self-Attention for Medical Image Analysis

Authors: Adrit Rao, Jongchan Park, Sanghyun Woo, Joon-Young Lee, Oliver Aalami

Abstract: When the trained physician interprets medical images, they understand the clinical importance of visual features. By applying cognitive attention, they apply greater focus onto clinically relevant regions while disregarding unnecessary features. The use of computer vision to automate the classification of medical images is widely studied. However, the standard convolutional neural network (CNN) do… ▽ More When the trained physician interprets medical images, they understand the clinical importance of visual features. By applying cognitive attention, they apply greater focus onto clinically relevant regions while disregarding unnecessary features. The use of computer vision to automate the classification of medical images is widely studied. However, the standard convolutional neural network (CNN) does not necessarily employ subconscious feature relevancy evaluation techniques similar to the trained medical specialist and evaluates features more generally. Self-attention mechanisms enable CNNs to focus more on semantically important regions or aggregated relevant context with long-range dependencies. By using attention, medical image analysis systems can potentially become more robust by focusing on more important clinical feature regions. In this paper, we provide a comprehensive comparison of various state-of-the-art self-attention mechanisms across multiple medical image analysis tasks. Through both quantitative and qualitative evaluations along with a clinical user-centric survey study, we aim to provide a deeper understanding of the effects of self-attention in medical computer vision tasks. △ Less

Submitted 2 September, 2021; originally announced September 2021.

Comments: ICCV 2021 CVAMD

arXiv:2108.09954 [pdf]

Pulse-Width Modulation Neuron Implemented by Single Positive-Feedback Device

Authors: Sung Yun Woo, Dongseok Kwon, Byung-Gook Park, Jong-Ho Lee, Jong-Ho Bae

Abstract: Positive-feedback (PF) device and its operation scheme to implement pulse width modulation (PWM) function was proposed and demonstrated, and the device operation mechanism for implementing PWM function was analyzed. By adjusting the amount of the charge stored in the n- floating body (Qn), the potential of the floating body linearly changes with time. When Qn reaches to a threshold value (Qth), th… ▽ More Positive-feedback (PF) device and its operation scheme to implement pulse width modulation (PWM) function was proposed and demonstrated, and the device operation mechanism for implementing PWM function was analyzed. By adjusting the amount of the charge stored in the n- floating body (Qn), the potential of the floating body linearly changes with time. When Qn reaches to a threshold value (Qth), the PF device turns on abruptly. From the linear time-varying property of Qn and the gate bias dependency of Qth, fully functionable PWM neuron properties including voltage to pulse width conversion and hard-sigmoid activation function were successfully obtained from a single PF device. A PWM neuron can be implemented by using a single PF device, thus it is beneficial to extremely reduce the area of a PWM neuron circuit than the previously reported one. △ Less

Submitted 23 August, 2021; originally announced August 2021.

arXiv:2108.05570 [pdf, other]

LabOR: Labeling Only if Required for Domain Adaptive Semantic Segmentation

Authors: Inkyu Shin, Dong-jin Kim, Jae Won Cho, Sanghyun Woo, Kwanyong Park, In So Kweon

Abstract: Unsupervised Domain Adaptation (UDA) for semantic segmentation has been actively studied to mitigate the domain gap between label-rich source data and unlabeled target data. Despite these efforts, UDA still has a long way to go to reach the fully supervised performance. To this end, we propose a Labeling Only if Required strategy, LabOR, where we introduce a human-in-the-loop approach to adaptivel… ▽ More Unsupervised Domain Adaptation (UDA) for semantic segmentation has been actively studied to mitigate the domain gap between label-rich source data and unlabeled target data. Despite these efforts, UDA still has a long way to go to reach the fully supervised performance. To this end, we propose a Labeling Only if Required strategy, LabOR, where we introduce a human-in-the-loop approach to adaptively give scarce labels to points that a UDA model is uncertain about. In order to find the uncertain points, we generate an inconsistency mask using the proposed adaptive pixel selector and we label these segment-based regions to achieve near supervised performance with only a small fraction (about 2.2%) ground truth points, which we call "Segment based Pixel-Labeling (SPL)". To further reduce the efforts of the human annotator, we also propose "Point-based Pixel-Labeling (PPL)", which finds the most representative points for labeling within the generated inconsistency mask. This reduces efforts from 2.2% segment label to 40 points label while minimizing performance degradation. Through extensive experimentation, we show the advantages of this new framework for domain adaptive semantic segmentation while minimizing human labor costs. △ Less

Submitted 12 August, 2021; originally announced August 2021.

Comments: Accepted to ICCV 2021 (Oral)

arXiv:2108.05530 [pdf, other]

Flow-Aware Platoon Formation of Connected Automated Vehicles

Authors: Soomin Woo, Alexander Skabardonis

Abstract: Connected Automated Vehicles (CAVs) bring promise of increasing traffic capacity and energy efficiency by forming platoons with short headways on the road. However at low CAV penetration, the capacity gain will be small because the CAVs that randomly enter the road will be sparsely distributed, diminishing the probability of forming long platoons. Many researchers propose to solve this issue by pl… ▽ More Connected Automated Vehicles (CAVs) bring promise of increasing traffic capacity and energy efficiency by forming platoons with short headways on the road. However at low CAV penetration, the capacity gain will be small because the CAVs that randomly enter the road will be sparsely distributed, diminishing the probability of forming long platoons. Many researchers propose to solve this issue by platoon organization strategies, where the CAVs search for other CAVs on the road and change lanes if necessary to form longer platoons. However, the current literature does not analyze a potential risk of platoon organization in disrupting the flow and reducing the capacity by inducing more lane changes. In this research, we use driving model of Cooperative Adaptive Cruise Control (CACC) vehicles and human-driven vehicles that are validated with field experiments and find that platoon organization can indeed drop the capacity with more lane changes. But when the traffic demand is well below capacity, platoon organization forms longer CAV platoons without reducing the flow. Based on this finding, we develop the Flow-Aware platoon organization strategy, where the CAVs perform platoon organization conditionally on the local traffic state, i.e., a low flow and a high speed. We simulate the Flow-Aware platoon organization on a realistic freeway network and show that the CAVs successfully form longer platoons, while ensuring a maximal traffic flow. △ Less

Submitted 12 August, 2021; originally announced August 2021.

arXiv:2108.05080 [pdf, other]

FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset

Authors: Hasam Khalid, Shahroz Tariq, Minha Kim, Simon S. Woo

Abstract: While the significant advancements have made in the generation of deepfakes using deep learning technologies, its misuse is a well-known issue now. Deepfakes can cause severe security and privacy issues as they can be used to impersonate a person's identity in a video by replacing his/her face with another person's face. Recently, a new problem of generating synthesized human voice of a person is… ▽ More While the significant advancements have made in the generation of deepfakes using deep learning technologies, its misuse is a well-known issue now. Deepfakes can cause severe security and privacy issues as they can be used to impersonate a person's identity in a video by replacing his/her face with another person's face. Recently, a new problem of generating synthesized human voice of a person is emerging, where AI-based deep learning models can synthesize any person's voice requiring just a few seconds of audio. With the emerging threat of impersonation attacks using deepfake audios and videos, a new generation of deepfake detectors is needed to focus on both video and audio collectively. To develop a competent deepfake detector, a large amount of high-quality data is typically required to capture real-world (or practical) scenarios. Existing deepfake datasets either contain deepfake videos or audios, which are racially biased as well. As a result, it is critical to develop a high-quality video and audio deepfake dataset that can be used to detect both audio and video deepfakes simultaneously. To fill this gap, we propose a novel Audio-Video Deepfake dataset, FakeAVCeleb, which contains not only deepfake videos but also respective synthesized lip-synced fake audios. We generate this dataset using the most popular deepfake generation methods. We selected real YouTube videos of celebrities with four ethnic backgrounds to develop a more realistic multimodal dataset that addresses racial bias, and further help develop multimodal deepfake detectors. We performed several experiments using state-of-the-art detection methods to evaluate our deepfake dataset and demonstrate the challenges and usefulness of our multimodal Audio-Video deepfake dataset. △ Less

Submitted 1 March, 2022; v1 submitted 11 August, 2021; originally announced August 2021.

Comments: Part of Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks 2021)

ACM Class: I.4.9; I.5.4

arXiv:2107.12655 [pdf, other]

doi 10.1016/j.patcog.2023.109800

MKConv: Multidimensional Feature Representation for Point Cloud Analysis

Authors: Sungmin Woo, Dogyoon Lee, Sangwon Hwang, Woojin Kim, Sangyoun Lee

Abstract: Despite the remarkable success of deep learning, an optimal convolution operation on point clouds remains elusive owing to their irregular data structure. Existing methods mainly focus on designing an effective continuous kernel function that can handle an arbitrary point in continuous space. Various approaches exhibiting high performance have been proposed, but we observe that the standard pointw… ▽ More Despite the remarkable success of deep learning, an optimal convolution operation on point clouds remains elusive owing to their irregular data structure. Existing methods mainly focus on designing an effective continuous kernel function that can handle an arbitrary point in continuous space. Various approaches exhibiting high performance have been proposed, but we observe that the standard pointwise feature is represented by 1D channels and can become more informative when its representation involves additional spatial feature dimensions. In this paper, we present Multidimensional Kernel Convolution (MKConv), a novel convolution operator that learns to transform the point feature representation from a vector to a multidimensional matrix. Unlike standard point convolution, MKConv proceeds via two steps. (i) It first activates the spatial dimensions of local feature representation by exploiting multidimensional kernel weights. These spatially expanded features can represent their embedded information through spatial correlation as well as channel correlation in feature space, carrying more detailed local structure information. (ii) Then, discrete convolutions are applied to the multidimensional features which can be regarded as a grid-structured matrix. In this way, we can utilize the discrete convolutions for point cloud data without voxelization that suffers from information loss. Furthermore, we propose a spatial attention module, Multidimensional Local Attention (MLA), to provide comprehensive structure awareness within the local point set by reweighting the spatial feature dimensions. We demonstrate that MKConv has excellent applicability to point cloud processing tasks including object classification, object part segmentation, and scene semantic segmentation with superior results. △ Less

Submitted 17 July, 2023; v1 submitted 27 July, 2021; originally announced July 2021.

Comments: Accepted by Pattern Recognition 2023

Journal ref: Pattern Recognition 143C (2023) 109800

arXiv:2107.11052 [pdf, other]

Unsupervised Domain Adaptation for Video Semantic Segmentation

Authors: Inkyu Shin, Kwanyong Park, Sanghyun Woo, In So Kweon

Abstract: Unsupervised Domain Adaptation for semantic segmentation has gained immense popularity since it can transfer knowledge from simulation to real (Sim2Real) by largely cutting out the laborious per pixel labeling efforts at real. In this work, we present a new video extension of this task, namely Unsupervised Domain Adaptation for Video Semantic Segmentation. As it became easy to obtain large-scale v… ▽ More Unsupervised Domain Adaptation for semantic segmentation has gained immense popularity since it can transfer knowledge from simulation to real (Sim2Real) by largely cutting out the laborious per pixel labeling efforts at real. In this work, we present a new video extension of this task, namely Unsupervised Domain Adaptation for Video Semantic Segmentation. As it became easy to obtain large-scale video labels through simulation, we believe attempting to maximize Sim2Real knowledge transferability is one of the promising directions for resolving the fundamental data-hungry issue in the video. To tackle this new problem, we present a novel two-phase adaptation scheme. In the first step, we exhaustively distill source domain knowledge using supervised loss functions. Simultaneously, video adversarial training (VAT) is employed to align the features from source to target utilizing video context. In the second step, we apply video self-training (VST), focusing only on the target data. To construct robust pseudo labels, we exploit the temporal information in the video, which has been rarely explored in the previous image-based self-training approaches. We set strong baseline scores on 'VIPER to CityscapeVPS' adaptation scenario. We show that our proposals significantly outperform previous image-based UDA methods both on image-level (mIoU) and video-level (VPQ) evaluation metrics. △ Less

Submitted 13 September, 2021; v1 submitted 23 July, 2021; originally announced July 2021.

arXiv:2107.07154 [pdf, other]

What and When to Look?: Temporal Span Proposal Network for Video Relation Detection

Authors: Sangmin Woo, Junhyug Noh, Kangil Kim

Abstract: Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., between which objects are there an interaction? when do relations start and end?). To date, two representative methods ha… ▽ More Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., between which objects are there an interaction? when do relations start and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. We first point out limitations of these methods and propose a novel approach named Temporal Span Proposal Network (TSPN). TSPN tells what to look: it sparsifies relation search space by scoring relationness of object pair, i.e., measuring how probable a relation exist. TSPN tells when to look: it simultaneously predicts start-end timestamps (i.e., temporal spans) and categories of the all possible relations by utilizing full video context. These two designs enable a win-win scenario: it accelerates training by 2X or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVDR and VidOR). Moreover, comprehensive ablative experiments demonstrate the effectiveness of our approach. Codes are available at https://github.com/sangminwoo/Temporal-Span-Proposal-Network-VidVRD. △ Less

Submitted 5 October, 2022; v1 submitted 15 July, 2021; originally announced July 2021.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2107.02408 [pdf, other]

doi 10.1145/3474085.3475535

CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation

Authors: Minha Kim, Shahroz Tariq, Simon S. Woo

Abstract: Over the last few decades, artificial intelligence research has made tremendous strides, but it still heavily relies on fixed datasets in stationary environments. Continual learning is a growing field of research that examines how AI systems can learn sequentially from a continuous stream of linked data in the same way that biological systems do. Simultaneously, fake media such as deepfakes and sy… ▽ More Over the last few decades, artificial intelligence research has made tremendous strides, but it still heavily relies on fixed datasets in stationary environments. Continual learning is a growing field of research that examines how AI systems can learn sequentially from a continuous stream of linked data in the same way that biological systems do. Simultaneously, fake media such as deepfakes and synthetic face images have emerged as significant to current multimedia technologies. Recently, numerous method has been proposed which can detect deepfakes with high accuracy. However, they suffer significantly due to their reliance on fixed datasets in limited evaluation settings. Therefore, in this work, we apply continuous learning to neural networks' learning dynamics, emphasizing its potential to increase data efficiency significantly. We propose Continual Representation using Distillation (CoReD) method that employs the concept of Continual Learning (CL), Representation Learning (RL), and Knowledge Distillation (KD). We design CoReD to perform sequential domain adaptation tasks on new deepfake and GAN-generated synthetic face datasets, while effectively minimizing the catastrophic forgetting in a teacher-student model setting. Our extensive experimental results demonstrate that our method is efficient at domain adaptation to detect low-quality deepfakes videos and GAN-generated images from several datasets, outperforming the-state-of-art baseline methods. △ Less

Submitted 5 August, 2021; v1 submitted 6 July, 2021; originally announced July 2021.

Comments: 13 pages, 7 Figures, 13 Tables, Accepted for publication in the 29th ACM International Conference on Multimedia (ACMMM '21)

ACM Class: I.4.9; I.5.4

arXiv:2106.09453 [pdf, other]

Learning to Associate Every Segment for Video Panoptic Segmentation

Authors: Sanghyun Woo, Dahun Kim, Joon-Young Lee, In So Kweon

Abstract: Temporal correspondence - linking pixels or objects across frames - is a fundamental supervisory signal for the video models. For the panoptic understanding of dynamic scenes, we further extend this concept to every segment. Specifically, we aim to learn coarse segment-level matching and fine pixel-level matching together. We implement this idea by designing two novel learning objectives. To valid… ▽ More Temporal correspondence - linking pixels or objects across frames - is a fundamental supervisory signal for the video models. For the panoptic understanding of dynamic scenes, we further extend this concept to every segment. Specifically, we aim to learn coarse segment-level matching and fine pixel-level matching together. We implement this idea by designing two novel learning objectives. To validate our proposals, we adopt a deep siamese model and train the model to learn the temporal correspondence on two different levels (i.e., segment and pixel) along with the target task. At inference time, the model processes each frame independently without any extra computation and post-processing. We show that our per-frame inference model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets. Moreover, due to its high efficiency, the model runs in a fraction of time (3x) compared to the previous state-of-the-art approach. △ Less

Submitted 17 June, 2021; originally announced June 2021.

Comments: Accepted to CVPR2021

arXiv:2106.08543 [pdf, other]

doi 10.1109/TNNLS.2022.3159990

Tackling the Challenges in Scene Graph Generation with Local-to-Global Interactions

Authors: Sangmin Woo, Junhyug Noh, Kangil Kim

Abstract: In this work, we seek new insights into the underlying challenges of the Scene Graph Generation (SGG) task. Quantitative and qualitative analysis of the Visual Genome dataset implies -- 1) Ambiguity: even if inter-object relationship contains the same object (or predicate), they may not be visually or semantically similar, 2) Asymmetry: despite the nature of the relationship that embodied the dire… ▽ More In this work, we seek new insights into the underlying challenges of the Scene Graph Generation (SGG) task. Quantitative and qualitative analysis of the Visual Genome dataset implies -- 1) Ambiguity: even if inter-object relationship contains the same object (or predicate), they may not be visually or semantically similar, 2) Asymmetry: despite the nature of the relationship that embodied the direction, it was not well addressed in previous studies, and 3) Higher-order contexts: leveraging the identities of certain graph elements can help to generate accurate scene graphs. Motivated by the analysis, we design a novel SGG framework, Local-to-Global Interaction Networks (LOGIN). Locally, interactions extract the essence between three instances of subject, object, and background, while baking direction awareness into the network by explicitly constraining the input order of subject and object. Globally, interactions encode the contexts between every graph component (i.e., nodes and edges). Finally, Attract & Repel loss is utilized to fine-tune the distribution of predicate embeddings. By design, our framework enables predicting the scene graph in a bottom-up manner, leveraging the possible complementariness. To quantify how much LOGIN is aware of relational direction, a new diagnostic task called Bidirectional Relationship Classification (BRC) is also proposed. Experimental results demonstrate that LOGIN can successfully distinguish relational direction than existing methods (in BRC task), while showing state-of-the-art results on the Visual Genome benchmark (in SGG task). △ Less

Submitted 12 April, 2022; v1 submitted 15 June, 2021; originally announced June 2021.

Comments: IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

arXiv:2105.13617 [pdf, other]

FReTAL: Generalizing Deepfake Detection using Knowledge Distillation and Representation Learning

Authors: Minha Kim, Shahroz Tariq, Simon S. Woo

Abstract: As GAN-based video and image manipulation technologies become more sophisticated and easily accessible, there is an urgent need for effective deepfake detection technologies. Moreover, various deepfake generation techniques have emerged over the past few years. While many deepfake detection methods have been proposed, their performance suffers from new types of deepfake methods on which they are n… ▽ More As GAN-based video and image manipulation technologies become more sophisticated and easily accessible, there is an urgent need for effective deepfake detection technologies. Moreover, various deepfake generation techniques have emerged over the past few years. While many deepfake detection methods have been proposed, their performance suffers from new types of deepfake methods on which they are not sufficiently trained. To detect new types of deepfakes, the model should learn from additional data without losing its prior knowledge about deepfakes (catastrophic forgetting), especially when new deepfakes are significantly different. In this work, we employ the Representation Learning (ReL) and Knowledge Distillation (KD) paradigms to introduce a transfer learning-based Feature Representation Transfer Adaptation Learning (FReTAL) method. We use FReTAL to perform domain adaptation tasks on new deepfake datasets while minimizing catastrophic forgetting. Our student model can quickly adapt to new types of deepfake by distilling knowledge from a pre-trained teacher model and applying transfer learning without using source domain data during domain adaptation. Through experiments on FaceForensics++ datasets, we demonstrate that FReTAL outperforms all baselines on the domain adaptation task with up to 86.97% accuracy on low-quality deepfakes. △ Less

Submitted 28 May, 2021; originally announced May 2021.

Comments: 12 pages, 2 figures, 5 tables, accepted for publication at the Workshop on Media Forensics 2021

ACM Class: I.4.9; I.5.4

arXiv:2105.06117 [pdf, other]

TAR: Generalized Forensic Framework to Detect Deepfakes using Weakly Supervised Learning

Authors: Sangyup Lee, Shahroz Tariq, Junyaup Kim, Simon S. Woo

Abstract: Deepfakes have become a critical social problem, and detecting them is of utmost importance. Also, deepfake generation methods are advancing, and it is becoming harder to detect. While many deepfake detection models can detect different types of deepfakes separately, they perform poorly on generalizing the detection performance over multiple types of deepfake. This motivates us to develop a genera… ▽ More Deepfakes have become a critical social problem, and detecting them is of utmost importance. Also, deepfake generation methods are advancing, and it is becoming harder to detect. While many deepfake detection models can detect different types of deepfakes separately, they perform poorly on generalizing the detection performance over multiple types of deepfake. This motivates us to develop a generalized model to detect different types of deepfakes. Therefore, in this work, we introduce a practical digital forensic tool to detect different types of deepfakes simultaneously and propose Transfer learning-based Autoencoder with Residuals (TAR). The ultimate goal of our work is to develop a unified model to detect various types of deepfake videos with high accuracy, with only a small number of training samples that can work well in real-world settings. We develop an autoencoder-based detection model with Residual blocks and sequentially perform transfer learning to detect different types of deepfakes simultaneously. Our approach achieves a much higher generalized detection performance than the state-of-the-art methods on the FaceForensics++ dataset. In addition, we evaluate our model on 200 real-world Deepfake-in-the-Wild (DW) videos of 50 celebrities available on the Internet and achieve 89.49% zero-shot accuracy, which is significantly higher than the best baseline model (gaining 10.77%), demonstrating and validating the practicability of our approach. △ Less

Submitted 13 May, 2021; originally announced May 2021.

Comments: 16 pages, 3 figures, to be published in IFIP-SEC 2021

Showing 51–100 of 186 results for author: Woo, S