subscribe to arXiv mailings

SweepNet: Unsupervised Learning Shape Abstraction via Neural Sweepers

Authors: Mingrui Zhao, Yizhi Wang, Fenggen Yu, Changqing Zou, Ali Mahdavi-Amiri

Abstract: Shape abstraction is an important task for simplifying complex geometric structures while retaining essential features. Sweep surfaces, commonly found in human-made objects, aid in this process by effectively capturing and representing object geometry, thereby facilitating abstraction. In this paper, we introduce \papername, a novel approach to shape abstraction through sweep surfaces. We propose… ▽ More Shape abstraction is an important task for simplifying complex geometric structures while retaining essential features. Sweep surfaces, commonly found in human-made objects, aid in this process by effectively capturing and representing object geometry, thereby facilitating abstraction. In this paper, we introduce \papername, a novel approach to shape abstraction through sweep surfaces. We propose an effective parameterization for sweep surfaces, utilizing superellipses for profile representation and B-spline curves for the axis. This compact representation, requiring as few as 14 float numbers, facilitates intuitive and interactive editing while preserving shape details effectively. Additionally, by introducing a differentiable neural sweeper and an encoder-decoder architecture, we demonstrate the ability to predict sweep surface representations without supervision. We show the superiority of our model through several quantitative and qualitative experiments throughout the paper. Our code is available at https://mingrui-zhao.github.io/SweepNet/ △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 14 pages,20 figures, ECCV 2024

arXiv:2405.15305 [pdf, other]

Diff3DS: Generating View-Consistent 3D Sketch via Differentiable Curve Rendering

Authors: Yibo Zhang, Lihong Wang, Changqing Zou, Tieru Wu, Rui Ma

Abstract: 3D sketches are widely used for visually representing the 3D shape and structure of objects or scenes. However, the creation of 3D sketch often requires users to possess professional artistic skills. Existing research efforts primarily focus on enhancing the ability of interactive sketch generation in 3D virtual systems. In this work, we propose Diff3DS, a novel differentiable rendering framework… ▽ More 3D sketches are widely used for visually representing the 3D shape and structure of objects or scenes. However, the creation of 3D sketch often requires users to possess professional artistic skills. Existing research efforts primarily focus on enhancing the ability of interactive sketch generation in 3D virtual systems. In this work, we propose Diff3DS, a novel differentiable rendering framework for generating view-consistent 3D sketch by optimizing 3D parametric curves under various supervisions. Specifically, we perform perspective projection to render the 3D rational Bézier curves into 2D curves, which are subsequently converted to a 2D raster image via our customized differentiable rasterizer. Our framework bridges the domains of 3D sketch and raster image, achieving end-toend optimization of 3D sketch through gradients computed in the 2D image domain. Our Diff3DS can enable a series of novel 3D sketch generation tasks, including textto-3D sketch and image-to-3D sketch, supported by the popular distillation-based supervision, such as Score Distillation Sampling (SDS). Extensive experiments have yielded promising results and demonstrated the potential of our framework. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: Project: https://yiboz2001.github.io/Diff3DS/

arXiv:2405.00700 [pdf]

Oxygen vacancies modulated VO2 for neurons and Spiking Neural Network construction

Authors: Liang Li, Ting Zhou, Tong Liu, Zhiwei Liu, Yaping Li, Shuo Wu, Shanguang Zhao, Jinglin Zhu, Meiling Liu, Zhihan Lin, Bowen Sun, Jianjun Li, Fangwen Sun, Chongwen Zou

Abstract: Artificial neuronal devices are the basic building blocks for neuromorphic computing systems, which have been motivated by realistic brain emulation. Aiming for these applications, various device concepts have been proposed to mimic the neuronal dynamics and functions. While till now, the artificial neuron devices with high efficiency, high stability and low power consumption are still far from pr… ▽ More Artificial neuronal devices are the basic building blocks for neuromorphic computing systems, which have been motivated by realistic brain emulation. Aiming for these applications, various device concepts have been proposed to mimic the neuronal dynamics and functions. While till now, the artificial neuron devices with high efficiency, high stability and low power consumption are still far from practical application. Due to the special insulator-metal phase transition, Vanadium Dioxide (VO2) has been considered as an idea candidate for neuronal device fabrication. However, its intrinsic insulating state requires the VO2 neuronal device to be driven under large bias voltage, resulting in high power consumption and low frequency. Thus in the current study, we have addressed this challenge by preparing oxygen vacancies modulated VO2 film(VO2-x) and fabricating the VO2-x neuronal devices for Spiking Neural Networks (SNNs) construction. Results indicate the neuron devices can be operated under lower voltage with improved processing speed. The proposed VO2-x based back-propagation SNNs (BP-SNNs) system, trained with the MNIST dataset, demonstrates excellent accuracy in image recognition. Our study not only demonstrates the VO2-x based neurons and SNN system for practical application, but also offers an effective way to optimize the future neuromorphic computing systems by defect engineering strategy. △ Less

Submitted 16 April, 2024; originally announced May 2024.

Comments: 18 pages,4 figures

arXiv:2404.16452 [pdf, other]

PAD: Patch-Agnostic Defense against Adversarial Patch Attacks

Authors: Lihua Jing, Rui Wang, Wenqi Ren, Xin Dong, Cong Zou

Abstract: Adversarial patch attacks present a significant threat to real-world object detectors due to their practical feasibility. Existing defense methods, which rely on attack data or prior knowledge, struggle to effectively address a wide range of adversarial patches. In this paper, we show two inherent characteristics of adversarial patches, semantic independence and spatial heterogeneity, independent… ▽ More Adversarial patch attacks present a significant threat to real-world object detectors due to their practical feasibility. Existing defense methods, which rely on attack data or prior knowledge, struggle to effectively address a wide range of adversarial patches. In this paper, we show two inherent characteristics of adversarial patches, semantic independence and spatial heterogeneity, independent of their appearance, shape, size, quantity, and location. Semantic independence indicates that adversarial patches operate autonomously within their semantic context, while spatial heterogeneity manifests as distinct image quality of the patch area that differs from original clean image due to the independent generation process. Based on these observations, we propose PAD, a novel adversarial patch localization and removal method that does not require prior knowledge or additional training. PAD offers patch-agnostic defense against various adversarial patches, compatible with any pre-trained object detectors. Our comprehensive digital and physical experiments involving diverse patch types, such as localized noise, printable, and naturalistic patches, exhibit notable improvements over state-of-the-art works. Our code is available at https://github.com/Lihua-Jing/PAD. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR 2024

arXiv:2404.11265 [pdf, other]

doi 10.1109/ICCV51070.2023.00021

The Victim and The Beneficiary: Exploiting a Poisoned Model to Train a Clean Model on Poisoned Data

Authors: Zixuan Zhu, Rui Wang, Cong Zou, Lihua Jing

Abstract: Recently, backdoor attacks have posed a serious security threat to the training process of deep neural networks (DNNs). The attacked model behaves normally on benign samples but outputs a specific result when the trigger is present. However, compared with the rocketing progress of backdoor attacks, existing defenses are difficult to deal with these threats effectively or require benign samples to… ▽ More Recently, backdoor attacks have posed a serious security threat to the training process of deep neural networks (DNNs). The attacked model behaves normally on benign samples but outputs a specific result when the trigger is present. However, compared with the rocketing progress of backdoor attacks, existing defenses are difficult to deal with these threats effectively or require benign samples to work, which may be unavailable in real scenarios. In this paper, we find that the poisoned samples and benign samples can be distinguished with prediction entropy. This inspires us to propose a novel dual-network training framework: The Victim and The Beneficiary (V&B), which exploits a poisoned model to train a clean model without extra benign samples. Firstly, we sacrifice the Victim network to be a powerful poisoned sample detector by training on suspicious samples. Secondly, we train the Beneficiary network on the credible samples selected by the Victim to inhibit backdoor injection. Thirdly, a semi-supervised suppression strategy is adopted for erasing potential backdoors and improving model performance. Furthermore, to better inhibit missed poisoned samples, we propose a strong data augmentation method, AttentionMix, which works well with our proposed V&B framework. Extensive experiments on two widely used datasets against 6 state-of-the-art attacks demonstrate that our framework is effective in preventing backdoor injection and robust to various attacks while maintaining the performance on benign samples. Our code is available at https://github.com/Zixuan-Zhu/VaB. △ Less

Submitted 31 May, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

Comments: 13 pages, 6 figures, published to ICCV

Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2023: 155-164

arXiv:2404.09499 [pdf, other]

Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Authors: Shuaiying Hou, Hongyu Tao, Junheng Fang, Changqing Zou, Hujun Bao, Weiwei Xu

Abstract: Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. Many previous methods grapple with this inherently ambiguous task by introducing motion priors into the learning process. However, these approaches face difficulties in defining the complete configurations of such priors or training a robust model. In this paper, we present the Vid… ▽ More Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. Many previous methods grapple with this inherently ambiguous task by introducing motion priors into the learning process. However, these approaches face difficulties in defining the complete configurations of such priors or training a robust model. In this paper, we present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment between 3D human motion and 2D inputs, namely videos and 2D keypoints. To reduce the complexity of modeling motion priors, we model the motion data separately for the upper and lower body parts. Additionally, we align the motion data with a scale-invariant virtual skeleton to mitigate the interference of human skeleton variations to the motion priors. Evaluated on AIST++, the VTM showcases state-of-the-art performance in reconstructing 3D human motion from monocular videos. Notably, our VTM exhibits the capabilities for generalization to unseen view angles and in-the-wild videos. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.08252 [pdf, other]

MonoPatchNeRF: Improving Neural Radiance Fields with Patch-based Monocular Guidance

Authors: Yuqun Wu, Jae Yong Lee, Chuhang Zou, Shenlong Wang, Derek Hoiem

Abstract: The latest regularized Neural Radiance Field (NeRF) approaches produce poor geometry and view extrapolation for multiview stereo (MVS) benchmarks such as ETH3D. In this paper, we aim to create 3D models that provide accurate geometry and view synthesis, partially closing the large geometric performance gap between NeRF and traditional MVS methods. We propose a patch-based approach that effectively… ▽ More The latest regularized Neural Radiance Field (NeRF) approaches produce poor geometry and view extrapolation for multiview stereo (MVS) benchmarks such as ETH3D. In this paper, we aim to create 3D models that provide accurate geometry and view synthesis, partially closing the large geometric performance gap between NeRF and traditional MVS methods. We propose a patch-based approach that effectively leverages monocular surface normal and relative depth predictions. The patch-based ray sampling also enables the appearance regularization of normalized cross-correlation (NCC) and structural similarity (SSIM) between randomly sampled virtual and training views. We further show that "density restrictions" based on sparse structure-from-motion points can help greatly improve geometric accuracy with a slight drop in novel view synthesis metrics. Our experiments show 4x the performance of RegNeRF and 8x that of FreeNeRF on average F1@2cm for ETH3D MVS benchmark, suggesting a fruitful research direction to improve the geometric accuracy of NeRF-based models, and sheds light on a potential future approach to enable NeRF-based optimization to eventually outperform traditional MVS. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 26 pages, 15 figures

arXiv:2403.11077 [pdf, other]

Zippo: Zipping Color and Transparency Distributions into a Single Diffusion Model

Authors: Kangyang Xie, Binbin Yang, Hao Chen, Meng Wang, Cheng Zou, Hui Xue, Ming Yang, Chunhua Shen

Abstract: Beyond the superiority of the text-to-image diffusion model in generating high-quality images, recent studies have attempted to uncover its potential for adapting the learned semantic knowledge to visual perception tasks. In this work, instead of translating a generative diffusion model into a visual perception model, we explore to retain the generative ability with the perceptive adaptation. To a… ▽ More Beyond the superiority of the text-to-image diffusion model in generating high-quality images, recent studies have attempted to uncover its potential for adapting the learned semantic knowledge to visual perception tasks. In this work, instead of translating a generative diffusion model into a visual perception model, we explore to retain the generative ability with the perceptive adaptation. To accomplish this, we present Zippo, a unified framework for zipping the color and transparency distributions into a single diffusion model by expanding the diffusion latent into a joint representation of RGB images and alpha mattes. By alternatively selecting one modality as the condition and then applying the diffusion process to the counterpart modality, Zippo is capable of generating RGB images from alpha mattes and predicting transparency from input images. In addition to single-modality prediction, we propose a modality-aware noise reassignment strategy to further empower Zippo with jointly generating RGB images and its corresponding alpha mattes under the text guidance. Our experiments showcase Zippo's ability of efficient text-conditioned transparent image generation and present plausible results of Matte-to-RGB and RGB-to-Matte translation. △ Less

Submitted 19 March, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

arXiv:2403.09439 [pdf, other]

3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation

Authors: Frank Zhang, Yibo Zhang, Quan Zheng, Rui Ma, Wei Hua, Hujun Bao, Weiwei Xu, Changqing Zou

Abstract: Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used i… ▽ More Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 11 pages, 7 figures

arXiv:2403.07728 [pdf, other]

CAP: A General Algorithm for Online Selective Conformal Prediction with FCR Control

Authors: Yajie Bao, Yuyang Huo, Haojie Ren, Changliang Zou

Abstract: We study the problem of post-selection predictive inference in an online fashion. To avoid devoting resources to unimportant units, a preliminary selection of the current individual before reporting its prediction interval is common and meaningful in online predictive tasks. Since the online selection causes a temporal multiplicity in the selected prediction intervals, it is important to control t… ▽ More We study the problem of post-selection predictive inference in an online fashion. To avoid devoting resources to unimportant units, a preliminary selection of the current individual before reporting its prediction interval is common and meaningful in online predictive tasks. Since the online selection causes a temporal multiplicity in the selected prediction intervals, it is important to control the real-time false coverage-statement rate (FCR) which measures the overall miscoverage level. We develop a general framework named CAP (Calibration after Adaptive Pick) that performs an adaptive pick rule on historical data to construct a calibration set if the current individual is selected and then outputs a conformal prediction interval for the unobserved label. We provide tractable procedures for constructing the calibration set for popular online selection rules. We proved that CAP can achieve an exact selection-conditional coverage guarantee in the finite-sample and distribution-free regimes. To account for the distribution shift in online data, we also embed CAP into some recent dynamic conformal prediction algorithms and show that the proposed method can deliver long-run FCR control. Numerical results on both synthetic and real data corroborate that CAP can effectively control FCR around the target level and yield more narrowed prediction intervals over existing baselines across various settings. △ Less

Submitted 28 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2311.12818 [pdf, other]

Manifold Path Guiding for Importance Sampling Specular Chains

Authors: Zhimin Fan, Pengpei Hong, Jie Guo, Changqing Zou, Yanwen Guo, Ling-Qi Yan

Abstract: Complex visual effects such as caustics are often produced by light paths containing multiple consecutive specular vertices (dubbed specular chains), which pose a challenge to unbiased estimation in Monte Carlo rendering. In this work, we study the light transport behavior within a sub-path that is comprised of a specular chain and two non-specular separators. We show that the specular manifolds f… ▽ More Complex visual effects such as caustics are often produced by light paths containing multiple consecutive specular vertices (dubbed specular chains), which pose a challenge to unbiased estimation in Monte Carlo rendering. In this work, we study the light transport behavior within a sub-path that is comprised of a specular chain and two non-specular separators. We show that the specular manifolds formed by all the sub-paths could be exploited to provide coherence among sub-paths. By reconstructing continuous energy distributions from historical and coherent sub-paths, seed chains can be generated in the context of importance sampling and converge to admissible chains through manifold walks. We verify that importance sampling the seed chain in the continuous space reaches the goal of importance sampling the discrete admissible specular chain. Based on these observations and theoretical analyses, a progressive pipeline, manifold path guiding, is designed and implemented to importance sample challenging paths featuring long specular chains. To our best knowledge, this is the first general framework for importance sampling discrete specular chains in regular Monte Carlo rendering. Extensive experiments demonstrate that our method outperforms state-of-the-art unbiased solutions with up to 40x variance reduction, especially in typical scenes containing long specular chains and complex visibility. △ Less

Submitted 24 September, 2023; originally announced November 2023.

Comments: 14 pages, 19 figures

ACM Class: I.3.6

arXiv:2309.05941 [pdf]

Random Segmentation: New Traffic Obfuscation against Packet-Size-Based Side-Channel Attacks

Authors: Mnassar Alyami, Abdulmajeed Alghamdi, Mohammed Alkhowaiter, Cliff Zou, Yan Solihin

Abstract: Despite encryption, the packet size is still visible, enabling observers to infer private information in the Internet of Things (IoT) environment (e.g., IoT device identification). Packet padding obfuscates packet-length characteristics with a high data overhead because it relies on adding noise to the data. This paper proposes a more data-efficient approach that randomizes packet sizes without ad… ▽ More Despite encryption, the packet size is still visible, enabling observers to infer private information in the Internet of Things (IoT) environment (e.g., IoT device identification). Packet padding obfuscates packet-length characteristics with a high data overhead because it relies on adding noise to the data. This paper proposes a more data-efficient approach that randomizes packet sizes without adding noise. We achieve this by splitting large TCP segments into random-sized chunks; hence, the packet length distribution is obfuscated without adding noise data. Our client-server implementation using TCP sockets demonstrates the feasibility of our approach at the application level. We realize our packet size control by adjusting two local socket-programming parameters. First, we enable the TCP_NODELAY option to send out each packet with our specified length. Second, we downsize the sending buffer to prevent the sender from pushing out more data than can be received, which could disable our control of the packet sizes. We simulate our defense on a network trace of four IoT devices and show a reduction in device classification accuracy from 98% to 63%, close to random guessing. Meanwhile, the real-world data transmission experiments show that the added latency is reasonable, less than 21%, while the added packet header overhead is only about 5%. △ Less

Submitted 11 September, 2023; originally announced September 2023.

Comments: 15 pages, 4 figures, to appear in Sensors 2023

arXiv:2308.15902 [pdf]

Photonic time-delayed reservoir computing based on series coupled microring resonators with high memory capacity

Authors: Yijia Li, Ming Li, MingYi Gao, Chang-Ling Zou, Chun-Hua Dong, Jin Lu, Yali Qin, XiaoNiu Yang, Qi Xuan, Hongliang Ren

Abstract: On-chip microring resonators (MRRs) have been proposed to construct the time-delayed reservoir computing (RC), which offers promising configurations available for computation with high scalability, high-density computing, and easy fabrication. A single MRR, however, is inadequate to supply enough memory for the computational task with diverse memory requirements. Large memory needs are met by the… ▽ More On-chip microring resonators (MRRs) have been proposed to construct the time-delayed reservoir computing (RC), which offers promising configurations available for computation with high scalability, high-density computing, and easy fabrication. A single MRR, however, is inadequate to supply enough memory for the computational task with diverse memory requirements. Large memory needs are met by the MRR with optical feedback waveguide, but at the expense of its large footprint. In the structure, the ultra-long optical feedback waveguide substantially limits the scalable photonic RC integrated designs. In this paper, a time-delayed RC is proposed by utilizing a silicon-based nonlinear MRR in conjunction with an array of linear MRRs. These linear MRRs possess a high quality factor, providing sufficient memory capacity for the entire system. We quantitatively analyze and assess the proposed RC structure's performance on three classical tasks with diverse memory requirements, i.e., the Narma 10, Mackey-Glass, and Santa Fe chaotic timeseries prediction tasks. The proposed system exhibits comparable performance to the MRR with an ultra-long optical feedback waveguide-based system when it comes to handling the Narma 10 task, which requires a significant memory capacity. Nevertheless, the overall length of these linear MRRs is significantly smaller, by three orders of magnitude, compared to the ultra-long feedback waveguide in the MRR with optical feedback waveguide-based system. The compactness of this structure has significant implications for the scalability and seamless integration of photonic RC. △ Less

Submitted 30 August, 2023; originally announced August 2023.

arXiv:2308.13176 [pdf, other]

Using Adamic-Adar Index Algorithm to Predict Volunteer Collaboration: Less is More

Authors: Chao Wu, Peng Chen, Baiqiao Yin, Zijuan Lin, Chen Jiang, Di Yu, Changhong Zou, Chunwang Lui

Abstract: Social networks exhibit a complex graph-like structure due to the uncertainty surrounding potential collaborations among participants. Machine learning algorithms possess generic outstanding performance in multiple real-world prediction tasks. However, whether machine learning algorithms outperform specific algorithms designed for graph link prediction remains unknown to us. To address this issue,… ▽ More Social networks exhibit a complex graph-like structure due to the uncertainty surrounding potential collaborations among participants. Machine learning algorithms possess generic outstanding performance in multiple real-world prediction tasks. However, whether machine learning algorithms outperform specific algorithms designed for graph link prediction remains unknown to us. To address this issue, the Adamic-Adar Index (AAI), Jaccard Coefficient (JC) and common neighbour centrality (CNC) as representatives of graph-specific algorithms were applied to predict potential collaborations, utilizing data from volunteer activities during the Covid-19 pandemic in Shenzhen city, along with the classical machine learning algorithms such as random forest, support vector machine, and gradient boosting as single predictors and components of ensemble learning. This paper introduces that the AAI algorithm outperformed the traditional JC and CNC, and other machine learning algorithms in analyzing graph node attributes for this task. △ Less

Submitted 25 August, 2023; originally announced August 2023.

arXiv:2308.04669 [pdf, other]

A General Implicit Framework for Fast NeRF Composition and Rendering

Authors: Xinyu Gao, Ziyi Yang, Yunlu Zhao, Yuxiang Sun, Xiaogang Jin, Changqing Zou

Abstract: A variety of Neural Radiance Fields (NeRF) methods have recently achieved remarkable success in high render speed. However, current accelerating methods are specialized and incompatible with various implicit methods, preventing real-time composition over various types of NeRF works. Because NeRF relies on sampling along rays, it is possible to provide general guidance for acceleration. To that end… ▽ More A variety of Neural Radiance Fields (NeRF) methods have recently achieved remarkable success in high render speed. However, current accelerating methods are specialized and incompatible with various implicit methods, preventing real-time composition over various types of NeRF works. Because NeRF relies on sampling along rays, it is possible to provide general guidance for acceleration. To that end, we propose a general implicit pipeline for composing NeRF objects quickly. Our method enables the casting of dynamic shadows within or between objects using analytical light sources while allowing multiple NeRF objects to be seamlessly placed and rendered together with any arbitrary rigid transformations. Mainly, our work introduces a new surface representation known as Neural Depth Fields (NeDF) that quickly determines the spatial relationship between objects by allowing direct intersection computation between rays and implicit surfaces. It leverages an intersection neural network to query NeRF for acceleration instead of depending on an explicit spatial structure.Our proposed method is the first to enable both the progressive and interactive composition of NeRF objects. Additionally, it also serves as a previewing plugin for a range of existing NeRF works. △ Less

Submitted 4 January, 2024; v1 submitted 8 August, 2023; originally announced August 2023.

Comments: AAAI 2024

arXiv:2306.00314 [pdf]

Adversarial-Aware Deep Learning System based on a Secondary Classical Machine Learning Verification Approach

Authors: Mohammed Alkhowaiter, Hisham Kholidy, Mnassar Alyami, Abdulmajeed Alghamdi, Cliff Zou

Abstract: Deep learning models have been used in creating various effective image classification applications. However, they are vulnerable to adversarial attacks that seek to misguide the models into predicting incorrect classes. Our study of major adversarial attack models shows that they all specifically target and exploit the neural networking structures in their designs. This understanding makes us dev… ▽ More Deep learning models have been used in creating various effective image classification applications. However, they are vulnerable to adversarial attacks that seek to misguide the models into predicting incorrect classes. Our study of major adversarial attack models shows that they all specifically target and exploit the neural networking structures in their designs. This understanding makes us develop a hypothesis that most classical machine learning models, such as Random Forest (RF), are immune to adversarial attack models because they do not rely on neural network design at all. Our experimental study of classical machine learning models against popular adversarial attacks supports this hypothesis. Based on this hypothesis, we propose a new adversarial-aware deep learning system by using a classical machine learning model as the secondary verification system to complement the primary deep learning model in image classification. Although the secondary classical machine learning model has less accurate output, it is only used for verification purposes, which does not impact the output accuracy of the primary deep learning model, and at the same time, can effectively detect an adversarial attack when a clear mismatch occurs. Our experiments based on CIFAR-100 dataset show that our proposed approach outperforms current state-of-the-art adversarial defense systems. △ Less

Submitted 31 May, 2023; originally announced June 2023.

Comments: 17 pages, 3 figures

arXiv:2306.00095 [pdf]

doi 10.1109/GLOBECOM48099.2022.10001537

Side-Channel VoIP Profiling Attack against Customer Service Automated Phone System

Authors: Roy Laurens, Edo Christianto, Bruce Caulkins, Cliff C. Zou

Abstract: In many VoIP systems, Voice Activity Detection (VAD) is often used on VoIP traffic to suppress packets of silence in order to reduce the bandwidth consumption of phone calls. Unfortunately, although VoIP traffic is fully encrypted and secured, traffic analysis of this suppression can reveal identifying information about calls made to customer service automated phone systems. Because different cust… ▽ More In many VoIP systems, Voice Activity Detection (VAD) is often used on VoIP traffic to suppress packets of silence in order to reduce the bandwidth consumption of phone calls. Unfortunately, although VoIP traffic is fully encrypted and secured, traffic analysis of this suppression can reveal identifying information about calls made to customer service automated phone systems. Because different customer service phone systems have distinct, but fixed (pre-recorded) automated voice messages sent to customers, VAD silence suppression used in VoIP will enable an eavesdropper to profile and identify these automated voice messages. In this paper, we will use a popular enterprise VoIP system (Cisco CallManager), running the default Session Initiation Protocol (SIP) protocol, to demonstrate that an attacker can reliably use the silence suppression to profile calls to such VoIP systems. Our real-world experiments demonstrate that this side-channel profiling attack can be used to accurately identify not only what customer service phone number a customer calls, but also what following options are subsequently chosen by the caller in the phone conversation. △ Less

Submitted 31 May, 2023; originally announced June 2023.

Comments: 6 pages, 12 figures. Published in IEEE Global Communications Conference (GLOBECOM), 2022

Journal ref: 2022 IEEE Global Communications Conference, Rio de Janeiro, Brazil, 2022, pp. 6091-6096

arXiv:2305.04685 [pdf, other]

ARDIE: AR, Dialogue, and Eye Gaze Policies for Human-Robot Collaboration

Authors: Chelsea Zou, Kishan Chandan, Yan Ding, Shiqi Zhang

Abstract: Human-robot collaboration (HRC) has become increasingly relevant in industrial, household, and commercial settings. However, the effectiveness of such collaborations is highly dependent on the human and robots' situational awareness of the environment. Improving this awareness includes not only aligning perceptions in a shared workspace, but also bidirectionally communicating intent and visualizin… ▽ More Human-robot collaboration (HRC) has become increasingly relevant in industrial, household, and commercial settings. However, the effectiveness of such collaborations is highly dependent on the human and robots' situational awareness of the environment. Improving this awareness includes not only aligning perceptions in a shared workspace, but also bidirectionally communicating intent and visualizing different states of the environment to enhance scene understanding. In this paper, we propose ARDIE (Augmented Reality with Dialogue and Eye Gaze), a novel intelligent agent that leverages multi-modal feedback cues to enhance HRC. Our system utilizes a decision theoretic framework to formulate a joint policy that incorporates interactive augmented reality (AR), natural language, and eye gaze to portray current and future states of the environment. Through object-specific AR renders, the human can visualize future object interactions to make adjustments as needed, ultimately providing an interactive and efficient collaboration between humans and robots. △ Less

Submitted 8 May, 2023; originally announced May 2023.

arXiv:2304.14422 [pdf, other]

MINN: Learning the dynamics of differential-algebraic equations and application to battery modeling

Authors: Yicun Huang, Changfu Zou, Yang Li, Torsten Wik

Abstract: The concept of integrating physics-based and data-driven approaches has become popular for modeling sustainable energy systems. However, the existing literature mainly focuses on the data-driven surrogates generated to replace physics-based models. These models often trade accuracy for speed but lack the generalisability, adaptability, and interpretability inherent in physics-based models, which a… ▽ More The concept of integrating physics-based and data-driven approaches has become popular for modeling sustainable energy systems. However, the existing literature mainly focuses on the data-driven surrogates generated to replace physics-based models. These models often trade accuracy for speed but lack the generalisability, adaptability, and interpretability inherent in physics-based models, which are often indispensable in the modeling of real-world dynamic systems for optimization and control purposes. In this work, we propose a novel architecture for generating model-integrated neural networks (MINN) to allow integration on the level of learning physics-based dynamics of the system. The obtained hybrid model solves an unsettled research problem in control-oriented modeling, i.e., how to obtain an optimally simplified model that is physically insightful, numerically accurate, and computationally tractable simultaneously. We apply the proposed neural network architecture to model the electrochemical dynamics of lithium-ion batteries and show that MINN is extremely data-efficient to train while being sufficiently generalizable to previously unseen input data, owing to its underlying physical invariants. The MINN battery model has an accuracy comparable to the first principle-based model in predicting both the system outputs and any locally distributed electrochemical behaviors but achieves two orders of magnitude reduction in the solution time. △ Less

Submitted 27 April, 2023; originally announced April 2023.

arXiv:2303.17867 [pdf, other]

CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer

Authors: Linfeng Wen, Chengying Gao, Changqing Zou

Abstract: Content affinity loss including feature and pixel affinity is a main problem which leads to artifacts in photorealistic and video style transfer. This paper proposes a new framework named CAP-VSTNet, which consists of a new reversible residual network and an unbiased linear transform module, for versatile style transfer. This reversible residual network can not only preserve content affinity but n… ▽ More Content affinity loss including feature and pixel affinity is a main problem which leads to artifacts in photorealistic and video style transfer. This paper proposes a new framework named CAP-VSTNet, which consists of a new reversible residual network and an unbiased linear transform module, for versatile style transfer. This reversible residual network can not only preserve content affinity but not introduce redundant information as traditional reversible networks, and hence facilitate better stylization. Empowered by Matting Laplacian training loss which can address the pixel affinity loss problem led by the linear transform, the proposed framework is applicable and effective on versatile style transfer. Extensive experiments show that CAP-VSTNet can produce better qualitative and quantitative results in comparison with the state-of-the-art methods. △ Less

Submitted 31 March, 2023; originally announced March 2023.

Comments: CVPR 2023

arXiv:2303.10839 [pdf, other]

MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

Authors: Ye Wang, Bowei Jiang, Changqing Zou, Rui Ma

Abstract: Multifold observations are common for different data modalities, e.g., a 3D shape can be represented by multi-view images and an image can be described with different captions. Existing cross-modal contrastive representation learning (XM-CLR) methods such as CLIP are not fully suitable for multifold data as they only consider one positive pair and treat other pairs as negative when computing the c… ▽ More Multifold observations are common for different data modalities, e.g., a 3D shape can be represented by multi-view images and an image can be described with different captions. Existing cross-modal contrastive representation learning (XM-CLR) methods such as CLIP are not fully suitable for multifold data as they only consider one positive pair and treat other pairs as negative when computing the contrastive loss. In this paper, we propose MXM-CLR, a unified framework for contrastive learning of multifold cross-modal representations. MXM-CLR explicitly models and learns the relationships between multifold observations of instances from different modalities for more comprehensive representation learning. The key of MXM-CLR is a novel multifold-aware hybrid loss which considers multiple positive observations when computing the hard and soft relationships for the cross-modal data pairs. We conduct quantitative and qualitative comparisons with SOTA baselines for cross-modal retrieval tasks on the Text2Shape and Flickr30K datasets. We also perform extensive evaluations on the adaptability and generalizability of MXM-CLR, as well as ablation studies on the loss design and effects of batch sizes. The results show the superiority of MXM-CLR in learning better representations for the multifold data. The code is available at https://github.com/JLU-ICL/MXM-CLR. △ Less

Submitted 20 March, 2023; v1 submitted 19 March, 2023; originally announced March 2023.

Comments: 16 pages, 14 figures

arXiv:2302.14335 [pdf, other]

DC-Former: Diverse and Compact Transformer for Person Re-Identification

Authors: Wen Li, Cheng Zou, Meng Wang, Furong Xu, Jianan Zhao, Ruobing Zheng, Yuan Cheng, Wei Chu

Abstract: In person re-identification (re-ID) task, it is still challenging to learn discriminative representation by deep learning, due to limited data. Generally speaking, the model will get better performance when increasing the amount of data. The addition of similar classes strengthens the ability of the classifier to identify similar identities, thereby improving the discrimination of representation.… ▽ More In person re-identification (re-ID) task, it is still challenging to learn discriminative representation by deep learning, due to limited data. Generally speaking, the model will get better performance when increasing the amount of data. The addition of similar classes strengthens the ability of the classifier to identify similar identities, thereby improving the discrimination of representation. In this paper, we propose a Diverse and Compact Transformer (DC-Former) that can achieve a similar effect by splitting embedding space into multiple diverse and compact subspaces. Compact embedding subspace helps model learn more robust and discriminative embedding to identify similar classes. And the fusion of these diverse embeddings containing more fine-grained information can further improve the effect of re-ID. Specifically, multiple class tokens are used in vision transformer to represent multiple embedding spaces. Then, a self-diverse constraint (SDC) is applied to these spaces to push them away from each other, which makes each embedding space diverse and compact. Further, a dynamic weight controller(DWC) is further designed for balancing the relative importance among them during training. The experimental results of our method are promising, which surpass previous state-of-the-art methods on several commonly used person re-ID benchmarks. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: Accepted by AAAI23

arXiv:2212.14670 [pdf, other]

Hierarchical Deep Reinforcement Learning for VWAP Strategy Optimization

Authors: Xiaodong Li, Pangjing Wu, Chenxin Zou, Qing Li

Abstract: Designing an intelligent volume-weighted average price (VWAP) strategy is a critical concern for brokers, since traditional rule-based strategies are relatively static that cannot achieve a lower transaction cost in a dynamic market. Many studies have tried to minimize the cost via reinforcement learning, but there are bottlenecks in improvement, especially for long-duration strategies such as the… ▽ More Designing an intelligent volume-weighted average price (VWAP) strategy is a critical concern for brokers, since traditional rule-based strategies are relatively static that cannot achieve a lower transaction cost in a dynamic market. Many studies have tried to minimize the cost via reinforcement learning, but there are bottlenecks in improvement, especially for long-duration strategies such as the VWAP strategy. To address this issue, we propose a deep learning and hierarchical reinforcement learning jointed architecture termed Macro-Meta-Micro Trader (M3T) to capture market patterns and execute orders from different temporal scales. The Macro Trader first allocates a parent order into tranches based on volume profiles as the traditional VWAP strategy does, but a long short-term memory neural network is used to improve the forecasting accuracy. Then the Meta Trader selects a short-term subgoal appropriate to instant liquidity within each tranche to form a mini-tranche. The Micro Trader consequently extracts the instant market state and fulfils the subgoal with the lowest transaction cost. Our experiments over stocks listed on the Shanghai stock exchange demonstrate that our approach outperforms baselines in terms of VWAP slippage, with an average cost saving of 1.16 base points compared to the optimal baseline. △ Less

Submitted 11 December, 2022; originally announced December 2022.

arXiv:2212.00994 [pdf, ps, other]

Knowledge Graph Quality Evaluation under Incomplete Information

Authors: Xiaodong Li, Chenxin Zou, Yi Cai, Yuelong Zhu

Abstract: Knowledge graphs (KGs) have attracted more and more attentions because of their fundamental roles in many tasks. Quality evaluation for KGs is thus crucial and indispensable. Existing methods in this field evaluate KGs by either proposing new quality metrics from different dimensions or measuring performances at KG construction stages. However, there are two major issues with those methods. First,… ▽ More Knowledge graphs (KGs) have attracted more and more attentions because of their fundamental roles in many tasks. Quality evaluation for KGs is thus crucial and indispensable. Existing methods in this field evaluate KGs by either proposing new quality metrics from different dimensions or measuring performances at KG construction stages. However, there are two major issues with those methods. First, they highly rely on raw data in KGs, which makes KGs' internal information exposed during quality evaluation. Second, they consider more about the quality at data level instead of ability level, where the latter one is more important for downstream applications. To address these issues, we propose a knowledge graph quality evaluation framework under incomplete information (QEII). The quality evaluation task is transformed into an adversarial Q&A game between two KGs. Winner of the game is thus considered to have better qualities. During the evaluation process, no raw data is exposed, which ensures information protection. Experimental results on four pairs of KGs demonstrate that, compared with baselines, the QEII implements a reasonable quality evaluation at ability level under incomplete information. △ Less

Submitted 12 April, 2023; v1 submitted 2 December, 2022; originally announced December 2022.

arXiv:2212.00914 [pdf, other]

QFF: Quantized Fourier Features for Neural Field Representations

Authors: Jae Yong Lee, Yuqun Wu, Chuhang Zou, Shenlong Wang, Derek Hoiem

Abstract: Multilayer perceptrons (MLPs) learn high frequencies slowly. Recent approaches encode features in spatial bins to improve speed of learning details, but at the cost of larger model size and loss of continuity. Instead, we propose to encode features in bins of Fourier features that are commonly used for positional encoding. We call these Quantized Fourier Features (QFF). As a naturally multiresolut… ▽ More Multilayer perceptrons (MLPs) learn high frequencies slowly. Recent approaches encode features in spatial bins to improve speed of learning details, but at the cost of larger model size and loss of continuity. Instead, we propose to encode features in bins of Fourier features that are commonly used for positional encoding. We call these Quantized Fourier Features (QFF). As a naturally multiresolution and periodic representation, our experiments show that using QFF can result in smaller model size, faster training, and better quality outputs for several applications, including Neural Image Representations (NIR), Neural Radiance Field (NeRF) and Signed Distance Function (SDF) modeling. QFF are easy to code, fast to compute, and serve as a simple drop-in addition to many neural field representations. △ Less

Submitted 1 December, 2022; originally announced December 2022.

arXiv:2210.07582 [pdf, other]

Deep PatchMatch MVS with Learned Patch Coplanarity, Geometric Consistency and Adaptive Pixel Sampling

Authors: Jae Yong Lee, Chuhang Zou, Derek Hoiem

Abstract: Recent work in multi-view stereo (MVS) combines learnable photometric scores and regularization with PatchMatch-based optimization to achieve robust pixelwise estimates of depth, normals, and visibility. However, non-learning based methods still outperform for large scenes with sparse views, in part due to use of geometric consistency constraints and ability to optimize over many views at high res… ▽ More Recent work in multi-view stereo (MVS) combines learnable photometric scores and regularization with PatchMatch-based optimization to achieve robust pixelwise estimates of depth, normals, and visibility. However, non-learning based methods still outperform for large scenes with sparse views, in part due to use of geometric consistency constraints and ability to optimize over many views at high resolution. In this paper, we build on learning-based approaches to improve photometric scores by learning patch coplanarity and encourage geometric consistency by learning a scaled photometric cost that can be combined with reprojection error. We also propose an adaptive pixel sampling strategy for candidate propagation that reduces memory to enable training on larger resolution with more views and a larger encoder. These modifications lead to 6-15% gains in accuracy and completeness on the challenging ETH3D benchmark, resulting in higher F1 performance than the widely used state-of-the-art non-learning approaches ACMM and ACMP. △ Less

Submitted 14 October, 2022; originally announced October 2022.

arXiv:2207.01216 [pdf, other]

Solutions for Fine-grained and Long-tailed Snake Species Recognition in SnakeCLEF 2022

Authors: Cheng Zou, Furong Xu, Meng Wang, Wen Li, Yuan Cheng

Abstract: Automatic snake species recognition is important because it has vast potential to help lower deaths and disabilities caused by snakebites. We introduce our solution in SnakeCLEF 2022 for fine-grained snake species recognition on a heavy long-tailed class distribution. First, a network architecture is designed to extract and fuse features from multiple modalities, i.e. photograph from visual modali… ▽ More Automatic snake species recognition is important because it has vast potential to help lower deaths and disabilities caused by snakebites. We introduce our solution in SnakeCLEF 2022 for fine-grained snake species recognition on a heavy long-tailed class distribution. First, a network architecture is designed to extract and fuse features from multiple modalities, i.e. photograph from visual modality and geographic locality information from language modality. Then, logit adjustment based methods are studied to relieve the impact caused by the severe class imbalance. Next, a combination of supervised and self-supervised learning method is proposed to make full use of the dataset, including both labeled training data and unlabeled testing data. Finally, post processing strategies, such as multi-scale and multi-crop test-time-augmentation, location filtering and model ensemble, are employed for better performance. With an ensemble of several different models, a private score 82.65%, ranking the 3rd, is achieved on the final leaderboard. △ Less

Submitted 4 July, 2022; originally announced July 2022.

Comments: Top solutions for FGVC9, accepted to CLEF2022

arXiv:2206.06741 [pdf, other]

Recurrent Transformer Variational Autoencoders for Multi-Action Motion Synthesis

Authors: Rania Briq, Chuhang Zou, Leonid Pishchulin, Chris Broaddus, Juergen Gall

Abstract: We consider the problem of synthesizing multi-action human motion sequences of arbitrary lengths. Existing approaches have mastered motion sequence generation in single action scenarios, but fail to generalize to multi-action and arbitrary-length sequences. We fill this gap by proposing a novel efficient approach that leverages expressiveness of Recurrent Transformers and generative richness of co… ▽ More We consider the problem of synthesizing multi-action human motion sequences of arbitrary lengths. Existing approaches have mastered motion sequence generation in single action scenarios, but fail to generalize to multi-action and arbitrary-length sequences. We fill this gap by proposing a novel efficient approach that leverages expressiveness of Recurrent Transformers and generative richness of conditional Variational Autoencoders. The proposed iterative approach is able to generate smooth and realistic human motion sequences with an arbitrary number of actions and frames while doing so in linear space and time. We train and evaluate the proposed approach on PROX and Charades datasets, where we augment PROX with ground-truth action labels and Charades with human mesh annotations. Experimental evaluation shows significant improvements in FID score and semantic consistency metrics compared to the state-of-the-art. △ Less

Submitted 27 June, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: accepted at Transformers for Vision workshop at CVPR 2022

arXiv:2205.09335 [pdf, other]

A Simple Yet Effective SVD-GCN for Directed Graphs

Authors: Chunya Zou, Andi Han, Lequan Lin, Junbin Gao

Abstract: In this paper, we propose a simple yet effective graph neural network for directed graphs (digraph) based on the classic Singular Value Decomposition (SVD), named SVD-GCN. The new graph neural network is built upon the graph SVD-framelet to better decompose graph signals on the SVD ``frequency'' bands. Further the new framelet SVD-GCN is also scaled up for larger scale graphs via using Chebyshev p… ▽ More In this paper, we propose a simple yet effective graph neural network for directed graphs (digraph) based on the classic Singular Value Decomposition (SVD), named SVD-GCN. The new graph neural network is built upon the graph SVD-framelet to better decompose graph signals on the SVD ``frequency'' bands. Further the new framelet SVD-GCN is also scaled up for larger scale graphs via using Chebyshev polynomial approximation. Through empirical experiments conducted on several node classification datasets, we have found that SVD-GCN has remarkable improvements in a variety of graph node learning tasks and it outperforms GCN and many other state-of-the-art graph neural networks for digraphs. Moreover, we empirically demonstate that the SVD-GCN has great denoising capability and robustness to high level graph data attacks. The theoretical and experimental results prove that the SVD-GCN is effective on a variant of graph datasets, meanwhile maintaining stable and even better performance than the state-of-the-arts. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: 14 pages

arXiv:2202.06738 [pdf, other]

Attention-based Deep Neural Networks for Battery Discharge Capacity Forecasting

Authors: Yadong Zhang, Chenye Zou, Xin Chen

Abstract: Battery discharge capacity forecasting is critically essential for the applications of lithium-ion batteries. The capacity degeneration can be treated as the memory of the initial battery state of charge from the data point of view. The streaming sensor data collected by battery management systems (BMS) reflect the usable battery capacity degradation rates under various operational working conditi… ▽ More Battery discharge capacity forecasting is critically essential for the applications of lithium-ion batteries. The capacity degeneration can be treated as the memory of the initial battery state of charge from the data point of view. The streaming sensor data collected by battery management systems (BMS) reflect the usable battery capacity degradation rates under various operational working conditions. The battery capacity in different cycles can be measured with the temporal patterns extracted from the streaming sensor data based on the attention mechanism. The attention-based similarity regarding the first cycle can describe the battery capacity degradation in the following cycles. The deep degradation network (DDN) is developed with the attention mechanism to measure similarity and predict battery capacity. The DDN model can extract the degeneration-related temporal patterns from the streaming sensor data and perform the battery capacity prediction efficiently online in real-time. Based on the MIT-Stanford open-access battery aging dataset, the root-mean-square error of the capacity estimation is 1.3 mAh. The mean absolute percentage error of the proposed DDN model is 0.06{\%}. The DDN model also performance well in the Oxford Battery Degradation Dataset with dynamic load profiles. Therefore, the high accuracy and strong robustness of the proposed algorithm are verified. △ Less

Submitted 14 February, 2022; originally announced February 2022.

arXiv:2112.07383 [pdf, other]

Improving Human-Object Interaction Detection via Phrase Learning and Label Composition

Authors: Zhimin Li, Cheng Zou, Yu Zhao, Boxun Li, Sheng Zhong

Abstract: Human-Object Interaction (HOI) detection is a fundamental task in high-level human-centric scene understanding. We propose PhraseHOI, containing a HOI branch and a novel phrase branch, to leverage language prior and improve relation expression. Specifically, the phrase branch is supervised by semantic embeddings, whose ground truths are automatically converted from the original HOI annotations wit… ▽ More Human-Object Interaction (HOI) detection is a fundamental task in high-level human-centric scene understanding. We propose PhraseHOI, containing a HOI branch and a novel phrase branch, to leverage language prior and improve relation expression. Specifically, the phrase branch is supervised by semantic embeddings, whose ground truths are automatically converted from the original HOI annotations without extra human efforts. Meanwhile, a novel label composition method is proposed to deal with the long-tailed problem in HOI, which composites novel phrase labels by semantic neighbors. Further, to optimize the phrase branch, a loss composed of a distilling loss and a balanced triplet loss is proposed. Extensive experiments are conducted to prove the effectiveness of the proposed PhraseHOI, which achieves significant improvement over the baseline and surpasses previous state-of-the-art methods on Full and NonRare on the challenging HICO-DET benchmark. △ Less

Submitted 15 January, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

Comments: Accepted to AAAI2022

arXiv:2112.04761 [pdf, other]

HBReID: Harder Batch for Re-identification

Authors: Wen Li, Furong Xu, Jianan Zhao, Ruobing Zheng, Cheng Zou, Meng Wang, Yuan Cheng

Abstract: Triplet loss is a widely adopted loss function in ReID task which pulls the hardest positive pairs close and pushes the hardest negative pairs far away. However, the selected samples are not the hardest globally, but the hardest only in a mini-batch, which will affect the performance. In this report, a hard batch mining method is proposed to mine the hardest samples globally to make triplet harder… ▽ More Triplet loss is a widely adopted loss function in ReID task which pulls the hardest positive pairs close and pushes the hardest negative pairs far away. However, the selected samples are not the hardest globally, but the hardest only in a mini-batch, which will affect the performance. In this report, a hard batch mining method is proposed to mine the hardest samples globally to make triplet harder. More specifically, the most similar classes are selected into a same mini-batch so that the similar classes could be pushed further away. Besides, an adversarial scene removal module composed of a scene classifier and an adversarial loss is used to learn scene invariant feature representations. Experiments are conducted on dataset MSMT17 to prove the effectiveness, and our method surpasses all of the previous methods and sets state-of-the-art result. △ Less

Submitted 9 December, 2021; originally announced December 2021.

arXiv:2112.02889 [pdf, other]

doi 10.1007/978-3-031-19809-0_39

Joint Learning of Localized Representations from Medical Images and Reports

Authors: Philip Müller, Georgios Kaissis, Congyu Zou, Daniel Rueckert

Abstract: Contrastive learning has proven effective for pre-training image models on unlabeled data with promising results for tasks such as medical image classification. Using paired text (like radiological reports) during pre-training improves the results even further. Still, most existing methods target image classification downstream tasks and may not be optimal for localized tasks like semantic segment… ▽ More Contrastive learning has proven effective for pre-training image models on unlabeled data with promising results for tasks such as medical image classification. Using paired text (like radiological reports) during pre-training improves the results even further. Still, most existing methods target image classification downstream tasks and may not be optimal for localized tasks like semantic segmentation or object detection. We therefore propose Localized representation learning from Vision and Text (LoVT), to our best knowledge, the first text-supervised pre-training method that targets localized medical imaging tasks. Our method combines instance-level image-report contrastive learning with local contrastive learning on image region and report sentence representations. We evaluate LoVT and commonly used pre-training methods on an evaluation framework of 18 localized tasks on chest X-rays from five public datasets. LoVT performs best on 10 of the 18 studied tasks making it the preferred method of choice for localized tasks. △ Less

Submitted 31 August, 2022; v1 submitted 6 December, 2021; originally announced December 2021.

Comments: Accepted at ECCV 2022

Journal ref: Computer Vision - ECCV 2022, pp. 685-701

arXiv:2109.11913 [pdf]

Spatial Information Refinement for Chroma Intra Prediction in Video Coding

Authors: Chengyi Zou, Shuai Wan, Tiannan Ji, Marta Mrak, Marc Gorriz Blanch, Luis Herranz

Abstract: Video compression benefits from advanced chroma intra prediction methods, such as the Cross-Component Linear Model (CCLM) which uses linear models to approximate the relationship between the luma and chroma components. Recently it has been proven that advanced cross-component prediction methods based on Neural Networks (NN) can bring additional coding gains. In this paper, spatial information refi… ▽ More Video compression benefits from advanced chroma intra prediction methods, such as the Cross-Component Linear Model (CCLM) which uses linear models to approximate the relationship between the luma and chroma components. Recently it has been proven that advanced cross-component prediction methods based on Neural Networks (NN) can bring additional coding gains. In this paper, spatial information refinement is proposed for improving NN-based chroma intra prediction. Specifically, the performance of chroma intra prediction can be improved by refined down-sampling or by incorporating location information. Experimental results show that the two proposed methods obtain 0.31%, 2.64%, 2.02% and 0.33%, 3.00%, 2.12% BD-rate reduction on Y, Cb and Cr components, respectively, under All-Intra configuration, when implemented in Versatile Video Coding (H.266/VVC) test model. Index Terms-Chroma intra prediction, convolutional neural networks, spatial information refinement. △ Less

Submitted 24 September, 2021; originally announced September 2021.

arXiv:2109.01971 [pdf, other]

Horizontal and Vertical Collaboration for VR Delivery in MEC-Enabled Small-Cell Networks

Authors: Zhuojia Gu, Hancheng Lu, Chenkai Zou

Abstract: Due to the large bandwidth, low latency and computationally intensive features of virtual reality (VR) video applications, the current resource-constrained wireless and edge networks cannot meet the requirements of on-demand VR delivery. In this letter, we propose a joint horizontal and vertical collaboration architecture in mobile edge computing (MEC)-enabled small-cell networks for VR delivery.… ▽ More Due to the large bandwidth, low latency and computationally intensive features of virtual reality (VR) video applications, the current resource-constrained wireless and edge networks cannot meet the requirements of on-demand VR delivery. In this letter, we propose a joint horizontal and vertical collaboration architecture in mobile edge computing (MEC)-enabled small-cell networks for VR delivery. In the proposed architecture, multiple MEC servers can jointly provide VR head-mounted devices (HMDs) with edge caching and viewpoint computation services, while the computation tasks can also be performed at HMDs or on the cloud. Power allocation at base stations (BSs) is considered in coordination with horizontal collaboration (HC) and vertical collaboration (VC) of MEC servers to obtain lower end-to-end latency of VR delivery. A joint caching, power allocation and task offloading problem is then formulated, and a discrete branch-reduce-and-bound (DBRB) algorithm inspired by monotone optimization is proposed to effectively solve the problem. Simulation results demonstrate the advantage of the proposed architecture and algorithm in terms of existing ones. △ Less

Submitted 4 September, 2021; originally announced September 2021.

Comments: 5 pages, 5 figures

arXiv:2108.08943 [pdf, other]

PatchMatch-RL: Deep MVS with Pixelwise Depth, Normal, and Visibility

Authors: Jae Yong Lee, Joseph DeGol, Chuhang Zou, Derek Hoiem

Abstract: Recent learning-based multi-view stereo (MVS) methods show excellent performance with dense cameras and small depth ranges. However, non-learning based approaches still outperform for scenes with large depth ranges and sparser wide-baseline views, in part due to their PatchMatch optimization over pixelwise estimates of depth, normals, and visibility. In this paper, we propose an end-to-end trainab… ▽ More Recent learning-based multi-view stereo (MVS) methods show excellent performance with dense cameras and small depth ranges. However, non-learning based approaches still outperform for scenes with large depth ranges and sparser wide-baseline views, in part due to their PatchMatch optimization over pixelwise estimates of depth, normals, and visibility. In this paper, we propose an end-to-end trainable PatchMatch-based MVS approach that combines advantages of trainable costs and regularizations with pixelwise estimates. To overcome the challenge of the non-differentiable PatchMatch optimization that involves iterative sampling and hard decisions, we use reinforcement learning to minimize expected photometric cost and maximize likelihood of ground truth depth and normals. We incorporate normal estimation by using dilated patch kernels, and propose a recurrent cost regularization that applies beyond frontal plane-sweep algorithms to our pixelwise depth/normal estimates. We evaluate our method on widely used MVS benchmarks, ETH3D and Tanks and Temples (TnT), and compare to other state of the art learning based MVS models. On ETH3D, our method outperforms other recent learning-based approaches and performs comparably on advanced TnT. △ Less

Submitted 19 August, 2021; originally announced August 2021.

Comments: Accepted to ICCV 2021 for oral presentation

arXiv:2106.03021 [pdf, other]

doi 10.1109/TIP.2021.3087397

SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction

Authors: Zeyu Ruan, Changqing Zou, Longhai Wu, Gangshan Wu, Limin Wang

Abstract: Three-dimensional face dense alignment and reconstruction in the wild is a challenging problem as partial facial information is commonly missing in occluded and large pose face images. Large head pose variations also increase the solution space and make the modeling more difficult. Our key idea is to model occlusion and pose to decompose this challenging task into several relatively more manageabl… ▽ More Three-dimensional face dense alignment and reconstruction in the wild is a challenging problem as partial facial information is commonly missing in occluded and large pose face images. Large head pose variations also increase the solution space and make the modeling more difficult. Our key idea is to model occlusion and pose to decompose this challenging task into several relatively more manageable subtasks. To this end, we propose an end-to-end framework, termed as Self-aligned Dual face Regression Network (SADRNet), which predicts a pose-dependent face, a pose-independent face. They are combined by an occlusion-aware self-alignment to generate the final 3D face. Extensive experiments on two popular benchmarks, AFLW2000-3D and Florence, demonstrate that the proposed method achieves significant superior performance over existing state-of-the-art methods. △ Less

Submitted 5 June, 2021; originally announced June 2021.

Comments: To appear in IEEE Transactions on Image Processing. Code and model is available at https://github.com/MCG-NJU/SADRNet

arXiv:2104.05666 [pdf, other]

View-Guided Point Cloud Completion

Authors: Xuancheng Zhang, Yutong Feng, Siqi Li, Changqing Zou, Hai Wan, Xibin Zhao, Yandong Guo, Yue Gao

Abstract: This paper presents a view-guided solution for the task of point cloud completion. Unlike most existing methods directly inferring the missing points using shape priors, we address this task by introducing ViPC (view-guided point cloud completion) that takes the missing crucial global structure information from an extra single-view image. By leveraging a framework that sequentially performs effect… ▽ More This paper presents a view-guided solution for the task of point cloud completion. Unlike most existing methods directly inferring the missing points using shape priors, we address this task by introducing ViPC (view-guided point cloud completion) that takes the missing crucial global structure information from an extra single-view image. By leveraging a framework that sequentially performs effective cross-modality and cross-level fusions, our method achieves significantly superior results over typical existing solutions on a new large-scale dataset we collect for the view-guided point cloud completion task. △ Less

Submitted 13 April, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

Comments: 10 pages, 8 figures, CVPR2021

arXiv:2104.04671 [pdf]

doi 10.1109/ISCC55528.2022.9912787

A Web Infrastructure for Certifying Multimedia News Content for Fake News Defense

Authors: Edward L. Amoruso, Raghu Avula, Stephen P. Johnson, Cliff C. Zou

Abstract: In dealing with altered multimedia news content, also referred to as fake news, we present a ready-to-deploy scheme based on existing public key infrastructure as a new fake news defense paradigm. This scheme enables news organizations to certify/endorse a newsworthy multimedia news content and securely and conveniently pass this trust information to end users. A news organization can use our prog… ▽ More In dealing with altered multimedia news content, also referred to as fake news, we present a ready-to-deploy scheme based on existing public key infrastructure as a new fake news defense paradigm. This scheme enables news organizations to certify/endorse a newsworthy multimedia news content and securely and conveniently pass this trust information to end users. A news organization can use our program to digitally sign the multimedia news content with its private key. By installing a browser extension, an end user can easily verify whether a news content has been endorsed and by which organization. It is totally up to the end user whether to trust the news or the endorsing news organization. The underlining principles of our scheme are that fake news will sooner or later be identified as fake by general population, and a news organization puts its long-term reputation on the line when endorsing a news content. △ Less

Submitted 23 May, 2023; v1 submitted 9 April, 2021; originally announced April 2021.

Comments: 7 pages, 6 figures

arXiv:2103.04503 [pdf, other]

End-to-End Human Object Interaction Detection with HOI Transformer

Authors: Cheng Zou, Bohan Wang, Yue Hu, Junqi Liu, Qian Wu, Yu Zhao, Boxun Li, Chenguang Zhang, Chi Zhang, Yichen Wei, Jian Sun

Abstract: We propose HOI Transformer to tackle human object interaction (HOI) detection in an end-to-end manner. Current approaches either decouple HOI task into separated stages of object detection and interaction classification or introduce surrogate interaction problem. In contrast, our method, named HOI Transformer, streamlines the HOI pipeline by eliminating the need for many hand-designed components.… ▽ More We propose HOI Transformer to tackle human object interaction (HOI) detection in an end-to-end manner. Current approaches either decouple HOI task into separated stages of object detection and interaction classification or introduce surrogate interaction problem. In contrast, our method, named HOI Transformer, streamlines the HOI pipeline by eliminating the need for many hand-designed components. HOI Transformer reasons about the relations of objects and humans from global image context and directly predicts HOI instances in parallel. A quintuple matching loss is introduced to force HOI predictions in a unified way. Our method is conceptually much simpler and demonstrates improved accuracy. Without bells and whistles, HOI Transformer achieves $26.61\% $ $ AP $ on HICO-DET and $52.9\%$ $AP_{role}$ on V-COCO, surpassing previous methods with the advantage of being much simpler. We hope our approach will serve as a simple and effective alternative for HOI tasks. Code is available at https://github.com/bbepoch/HoiTransformer . △ Less

Submitted 7 March, 2021; originally announced March 2021.

Comments: Accepted to CVPR2021

arXiv:2103.01006 [pdf]

doi 10.1038/s44172-023-00066-3

GaNDLF: A Generally Nuanced Deep Learning Framework for Scalable End-to-End Clinical Workflows in Medical Imaging

Authors: Sarthak Pati, Siddhesh P. Thakur, İbrahim Ethem Hamamcı, Ujjwal Baid, Bhakti Baheti, Megh Bhalerao, Orhun Güley, Sofia Mouchtaris, David Lang, Spyridon Thermos, Karol Gotkowski, Camila González, Caleb Grenko, Alexander Getka, Brandon Edwards, Micah Sheller, Junwen Wu, Deepthi Karkada, Ravi Panchumarthy, Vinayak Ahluwalia, Chunrui Zou, Vishnu Bashyam, Yuemeng Li, Babak Haghighi, Rhea Chitalia , et al. (17 additional authors not shown)

Abstract: Deep Learning (DL) has the potential to optimize machine learning in both the scientific and clinical communities. However, greater expertise is required to develop DL algorithms, and the variability of implementations hinders their reproducibility, translation, and deployment. Here we present the community-driven Generally Nuanced Deep Learning Framework (GaNDLF), with the goal of lowering these… ▽ More Deep Learning (DL) has the potential to optimize machine learning in both the scientific and clinical communities. However, greater expertise is required to develop DL algorithms, and the variability of implementations hinders their reproducibility, translation, and deployment. Here we present the community-driven Generally Nuanced Deep Learning Framework (GaNDLF), with the goal of lowering these barriers. GaNDLF makes the mechanism of DL development, training, and inference more stable, reproducible, interpretable, and scalable, without requiring an extensive technical background. GaNDLF aims to provide an end-to-end solution for all DL-related tasks in computational precision medicine. We demonstrate the ability of GaNDLF to analyze both radiology and histology images, with built-in support for k-fold cross-validation, data augmentation, multiple modalities and output classes. Our quantitative performance evaluation on numerous use cases, anatomies, and computational tasks supports GaNDLF as a robust application framework for deployment in clinical workflows. △ Less

Submitted 16 May, 2023; v1 submitted 25 February, 2021; originally announced March 2021.

Comments: Deep Learning, Framework, Segmentation, Regression, Classification, Cross-validation, Data augmentation, Deployment, Clinical, Workflows

Journal ref: Commun Eng 2, 23 (2023)

arXiv:2007.05876 [pdf, other]

On Runtime Software Security of TrustZone-M based IoT Devices

Authors: Lan Luo, Yue Zhang, Cliff C. Zou, Xinhui Shao, Zhen Ling, Xinwen Fu

Abstract: Internet of Things (IoT) devices have been increasingly integrated into our daily life. However, such smart devices suffer a broad attack surface. Particularly, attacks targeting the device software at runtime are challenging to defend against if IoT devices use resource-constrained microcontrollers (MCUs). TrustZone-M, a TrustZone extension for MCUs, is an emerging security technique fortifying M… ▽ More Internet of Things (IoT) devices have been increasingly integrated into our daily life. However, such smart devices suffer a broad attack surface. Particularly, attacks targeting the device software at runtime are challenging to defend against if IoT devices use resource-constrained microcontrollers (MCUs). TrustZone-M, a TrustZone extension for MCUs, is an emerging security technique fortifying MCU based IoT devices. This paper presents the first security analysis of potential software security issues in TrustZone-M enabled MCUs. We explore the stack-based buffer overflow (BOF) attack for code injection, return-oriented programming (ROP) attack, heap-based BOF attack, format string attack, and attacks against Non-secure Callable (NSC) functions in the context of TrustZone-M. We validate these attacks using the TrustZone-M enabled SAM L11 MCU. Strategies to mitigate these software attacks are also discussed. △ Less

Submitted 11 July, 2020; originally announced July 2020.

Comments: 6 pages, 3 figures

arXiv:2003.13910 [pdf, other]

Attention-based Multi-modal Fusion Network for Semantic Scene Completion

Authors: Siqi Li, Changqing Zou, Yipeng Li, Xibin Zhao, Yue Gao

Abstract: This paper presents an end-to-end 3D convolutional network named attention-based multi-modal fusion network (AMFNet) for the semantic scene completion (SSC) task of inferring the occupancy and semantic labels of a volumetric 3D scene from single-view RGB-D images. Compared with previous methods which use only the semantic features extracted from RGB-D images, the proposed AMFNet learns to perform… ▽ More This paper presents an end-to-end 3D convolutional network named attention-based multi-modal fusion network (AMFNet) for the semantic scene completion (SSC) task of inferring the occupancy and semantic labels of a volumetric 3D scene from single-view RGB-D images. Compared with previous methods which use only the semantic features extracted from RGB-D images, the proposed AMFNet learns to perform effective 3D scene completion and semantic segmentation simultaneously via leveraging the experience of inferring 2D semantic segmentation from RGB-D images as well as the reliable depth cues in spatial dimension. It is achieved by employing a multi-modal fusion architecture boosted from 2D semantic segmentation and a 3D semantic completion network empowered by residual attention blocks. We validate our method on both the synthetic SUNCG-RGBD dataset and the real NYUv2 dataset and the results show that our method respectively achieves the gains of 2.5% and 2.6% on the synthetic SUNCG-RGBD dataset and the real NYUv2 dataset against the state-of-the-art method. △ Less

Submitted 15 April, 2020; v1 submitted 30 March, 2020; originally announced March 2020.

Comments: Accepted by AAAI 2020

arXiv:2003.02683 [pdf, other]

SketchyCOCO: Image Generation from Freehand Scene Sketches

Authors: Chengying Gao, Qi Liu, Qi Xu, Limin Wang, Jianzhuang Liu, Changqing Zou

Abstract: We introduce the first method for automatic image generation from scene-level freehand sketches. Our model allows for controllable image generation by specifying the synthesis goal via freehand sketches. The key contribution is an attribute vector bridged Generative Adversarial Network called EdgeGAN, which supports high visual-quality object-level image content generation without using freehand s… ▽ More We introduce the first method for automatic image generation from scene-level freehand sketches. Our model allows for controllable image generation by specifying the synthesis goal via freehand sketches. The key contribution is an attribute vector bridged Generative Adversarial Network called EdgeGAN, which supports high visual-quality object-level image content generation without using freehand sketches as training data. We have built a large-scale composite dataset called SketchyCOCO to support and evaluate the solution. We validate our approach on the tasks of both object-level and scene-level image generation on SketchyCOCO. Through quantitative, qualitative results, human evaluation and ablation studies, we demonstrate the method's capacity to generate realistic complex scene-level images from various freehand sketches. △ Less

Submitted 7 April, 2020; v1 submitted 5 March, 2020; originally announced March 2020.

arXiv:2002.07362 [pdf, other]

MILA: Multi-Task Learning from Videos via Efficient Inter-Frame Attention

Authors: Donghyun Kim, Tian Lan, Chuhang Zou, Ning Xu, Bryan A. Plummer, Stan Sclaroff, Jayan Eledath, Gerard Medioni

Abstract: Prior work in multi-task learning has mainly focused on predictions on a single image. In this work, we present a new approach for multi-task learning from videos via efficient inter-frame local attention (MILA). Our approach contains a novel inter-frame attention module which allows learning of task-specific attention across frames. We embed the attention module in a ``slow-fast'' architecture, w… ▽ More Prior work in multi-task learning has mainly focused on predictions on a single image. In this work, we present a new approach for multi-task learning from videos via efficient inter-frame local attention (MILA). Our approach contains a novel inter-frame attention module which allows learning of task-specific attention across frames. We embed the attention module in a ``slow-fast'' architecture, where the slower network runs on sparsely sampled keyframes and the light-weight shallow network runs on non-keyframes at a high frame rate. We also propose an effective adversarial learning strategy to encourage the slow and fast network to learn similar features. Our approach ensures low-latency multi-task learning while maintaining high quality predictions. Experiments show competitive accuracy compared to state-of-the-art on two multi-task learning benchmarks while reducing the number of floating point operations (FLOPs) by up to 70\%. In addition, our attention based feature propagation method (ILA) outperforms prior work in terms of task accuracy while also reducing up to 90\% of FLOPs. △ Less

Submitted 10 October, 2021; v1 submitted 17 February, 2020; originally announced February 2020.

Comments: Accepted in ICCV 2021 MTL Workshop

arXiv:1910.09447 [pdf, other]

Improving Style Transfer with Calibrated Metrics

Authors: Mao-Chuang Yeh, Shuai Tang, Anand Bhattad, Chuhang Zou, David Forsyth

Abstract: Style transfer methods produce a transferred image which is a rendering of a content image in the manner of a style image. We seek to understand how to improve style transfer. To do so requires quantitative evaluation procedures, but the current evaluation is qualitative, mostly involving user studies. We describe a novel quantitative evaluation procedure. Our procedure relies on two statistics:… ▽ More Style transfer methods produce a transferred image which is a rendering of a content image in the manner of a style image. We seek to understand how to improve style transfer. To do so requires quantitative evaluation procedures, but the current evaluation is qualitative, mostly involving user studies. We describe a novel quantitative evaluation procedure. Our procedure relies on two statistics: the Effectiveness (E) statistic measures the extent that a given style has been transferred to the target, and the Coherence (C) statistic measures the extent to which the original image's content is preserved. Our statistics are calibrated to human preference: targets with larger values of E (resp C) will reliably be preferred by human subjects in comparisons of style (resp. content). We use these statistics to investigate the relative performance of a number of Neural Style Transfer(NST) methods, revealing several intriguing properties. Admissible methods lie on a Pareto frontier (i.e. improving E reduces C or vice versa). Three methods are admissible: Universal style transfer produces very good C but weak E; modifying the optimization used for Gatys' loss produces a method with strong E and strong C; and a modified cross-layer method has slightly better E at strong cost in C. While the histogram loss improves the E statistics of Gatys' method, it does not make the method admissible. Surprisingly, style weights have relatively little effect in improving EC scores, and most variability in the transfer is explained by the style itself (meaning experimenters can be misguided by selecting styles). △ Less

Submitted 13 February, 2020; v1 submitted 21 October, 2019; originally announced October 2019.

Comments: updated conference camera ready version. arXiv admin note: text overlap with arXiv:1804.00118

arXiv:1910.05786 [pdf, other]

Progress Notes Classification and Keyword Extraction using Attention-based Deep Learning Models with BERT

Authors: Matthew Tang, Priyanka Gandhi, Md Ahsanul Kabir, Christopher Zou, Jordyn Blakey, Xiao Luo

Abstract: Various deep learning algorithms have been developed to analyze different types of clinical data including clinical text classification and extracting information from 'free text' and so on. However, automate the keyword extraction from the clinical notes is still challenging. The challenges include dealing with noisy clinical notes which contain various abbreviations, possible typos, and unstruct… ▽ More Various deep learning algorithms have been developed to analyze different types of clinical data including clinical text classification and extracting information from 'free text' and so on. However, automate the keyword extraction from the clinical notes is still challenging. The challenges include dealing with noisy clinical notes which contain various abbreviations, possible typos, and unstructured sentences. The objective of this research is to investigate the attention-based deep learning models to classify the de-identified clinical progress notes extracted from a real-world EHR system. The attention-based deep learning models can be used to interpret the models and understand the critical words that drive the correct or incorrect classification of the clinical progress notes. The attention-based models in this research are capable of presenting the human interpretable text classification models. The results show that the fine-tuned BERT with the attention layer can achieve a high classification accuracy of 97.6%, which is higher than the baseline fine-tuned BERT classification model. In this research, we also demonstrate that the attention-based models can identify relevant keywords that are strongly related to the clinical progress note categories. △ Less

Submitted 24 October, 2019; v1 submitted 13 October, 2019; originally announced October 2019.

arXiv:1910.04099 [pdf, other]

Manhattan Room Layout Reconstruction from a Single 360 image: A Comparative Study of State-of-the-art Methods

Authors: Chuhang Zou, Jheng-Wei Su, Chi-Han Peng, Alex Colburn, Qi Shan, Peter Wonka, Hung-Kuo Chu, Derek Hoiem

Abstract: Recent approaches for predicting layouts from 360 panoramas produce excellent results. These approaches build on a common framework consisting of three steps: a pre-processing step based on edge-based alignment, prediction of layout elements, and a post-processing step by fitting a 3D layout to the layout elements. Until now, it has been difficult to compare the methods due to multiple different d… ▽ More Recent approaches for predicting layouts from 360 panoramas produce excellent results. These approaches build on a common framework consisting of three steps: a pre-processing step based on edge-based alignment, prediction of layout elements, and a post-processing step by fitting a 3D layout to the layout elements. Until now, it has been difficult to compare the methods due to multiple different design decisions, such as the encoding network (e.g. SegNet or ResNet), type of elements predicted (e.g. corners, wall/floor boundaries, or semantic segmentation), or method of fitting the 3D layout. To address this challenge, we summarize and describe the common framework, the variants, and the impact of the design decisions. For a complete evaluation, we also propose extended annotations for the Matterport3D dataset [3], and introduce two depth-based evaluation metrics. △ Less

Submitted 25 December, 2020; v1 submitted 9 October, 2019; originally announced October 2019.

Comments: Accepted by International Journal of Computer Vision (IJCV), 2021

arXiv:1909.04326 [pdf, other]

Universal Physical Camouflage Attacks on Object Detectors

Authors: Lifeng Huang, Chengying Gao, Yuyin Zhou, Cihang Xie, Alan Yuille, Changqing Zou, Ning Liu

Abstract: In this paper, we study physical adversarial attacks on object detectors in the wild. Previous works mostly craft instance-dependent perturbations only for rigid or planar objects. To this end, we propose to learn an adversarial pattern to effectively attack all instances belonging to the same object category, referred to as Universal Physical Camouflage Attack (UPC). Concretely, UPC crafts camouf… ▽ More In this paper, we study physical adversarial attacks on object detectors in the wild. Previous works mostly craft instance-dependent perturbations only for rigid or planar objects. To this end, we propose to learn an adversarial pattern to effectively attack all instances belonging to the same object category, referred to as Universal Physical Camouflage Attack (UPC). Concretely, UPC crafts camouflage by jointly fooling the region proposal network, as well as misleading the classifier and the regressor to output errors. In order to make UPC effective for non-rigid or non-planar objects, we introduce a set of transformations for mimicking deformable properties. We additionally impose optimization constraint to make generated patterns look natural to human observers. To fairly evaluate the effectiveness of different physical-world attacks, we present the first standardized virtual database, AttackScenes, which simulates the real 3D world in a controllable and reproducible environment. Extensive experiments suggest the superiority of our proposed UPC compared with existing physical adversarial attackers not only in virtual environments (AttackScenes), but also in real-world physical environments. Code and dataset are available at https://mesunhlf.github.io/index_physical.html. △ Less

Submitted 21 April, 2020; v1 submitted 10 September, 2019; originally announced September 2019.

Comments: CVPR 2020; codes, models, and demos are available at https://mesunhlf.github.io/index_physical.html

arXiv:1909.00915 [pdf, other]

Counterfactual Depth from a Single RGB Image

Authors: Theerasit Issaranon, Chuhang Zou, David Forsyth

Abstract: We describe a method that predicts, from a single RGB image, a depth map that describes the scene when a masked object is removed - we call this "counterfactual depth" that models hidden scene geometry together with the observations. Our method works for the same reason that scene completion works: the spatial structure of objects is simple. But we offer a much higher resolution representation of… ▽ More We describe a method that predicts, from a single RGB image, a depth map that describes the scene when a masked object is removed - we call this "counterfactual depth" that models hidden scene geometry together with the observations. Our method works for the same reason that scene completion works: the spatial structure of objects is simple. But we offer a much higher resolution representation of space than current scene completion methods, as we operate at pixel-level precision and do not rely on a voxel representation. Furthermore, we do not require RGBD inputs. Our method uses a standard encoder-decoder architecture, and with a decoder modified to accept an object mask. We describe a small evaluation dataset that we have collected, which allows inference about what factors affect reconstruction most strongly. Using this dataset, we show that our depth predictions for masked objects are better than other baselines. △ Less

Submitted 2 September, 2019; originally announced September 2019.

Showing 1–50 of 74 results for author: Zou, C