-
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models
Authors:
Zijian He,
Peixin Chen,
Guangrun Wang,
Guanbin Li,
Philip H. S. Torr,
Liang Lin
Abstract:
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions, limiting their effectiveness in video try-on applications. Moreover, video-based models require extensive, high-quality data and…
▽ More
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions, limiting their effectiveness in video try-on applications. Moreover, video-based models require extensive, high-quality data and substantial computational resources. To tackle these issues, we reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach. This model, conditioned on specific garments and individuals, is trained on still images rather than videos. It leverages diffusion guidance from pre-trained models including a video masked autoencoder for segment smoothness improvement and a self-supervised model for feature alignment of adjacent frame in the latent space. This integration markedly boosts the model's ability to maintain temporal coherence, enabling more effective video try-on within an image-based framework. Our experiments on the VITON-HD and DressCode datasets, along with tests on the VVT and TikTok datasets, demonstrate WildVidFit's capability to generate fluid and coherent videos. The project page website is at wildvidfit-project.github.io.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Room temperature operation of germanium-silicon single-photon avalanche diode
Authors:
Neil Na,
Yen-Cheng Lu,
Yu-Hsuan Liu,
Po-Wei Chen,
Ying-Chen Lai,
You-Ru Lin,
Chung-Chih Lin,
Tim Shia,
Chih-Hao Cheng,
Shu-Lu Chen
Abstract:
The ability to detect single photons has led to the advancement of numerous research fields. Although various types of single-photon detector have been developed, because of two main factors - that is, (1) the need for operating at cryogenic temperature and (2) the incompatibility with complementary metal-oxide-semiconductor (CMOS) fabrication processes - so far, to our knowledge, only Si-based si…
▽ More
The ability to detect single photons has led to the advancement of numerous research fields. Although various types of single-photon detector have been developed, because of two main factors - that is, (1) the need for operating at cryogenic temperature and (2) the incompatibility with complementary metal-oxide-semiconductor (CMOS) fabrication processes - so far, to our knowledge, only Si-based single-photon avalanche diode (SPAD) has gained mainstream success and has been used in consumer electronics. With the growing demand to shift the operation wavelength from near-infrared to short-wavelength infrared (SWIR) for better safety and performance, an alternative solution is required because Si has negligible optical absorption for wavelengths beyond 1 μm. Here we report a CMOS-compatible, high-performing germanium-silicon SPAD operated at room temperature, featuring a noise-equivalent power improvement over the previous Ge-based SPADs by 2-3.5 orders of magnitude. Key parameters such as dark count rate, single-photon detection probability at 1,310 nm, timing jitter, after-pulsing characteristic time and after-pulsing probability are, respectively, measured as 19 kHz μm^2, 12%, 188 ps, ~90 ns and <1%, with a low breakdown voltage of 10.26 V and a small excess bias of 0.75 V. Three-dimensional point-cloud images are captured with direct time-of-flight technique as proof of concept. This work paves the way towards using single-photon-sensitive SWIR sensors, imagers and photonic integrated circuits in everyday life.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
ENOVA: Autoscaling towards Cost-effective and Stable Serverless LLM Serving
Authors:
Tao Huang,
Pengfei Chen,
Kyoka Gong,
Jocky Hawk,
Zachary Bright,
Wenxin Xie,
Kecheng Huang,
Zhi Ji
Abstract:
Since the increasing popularity of large language model (LLM) backend systems, it is common and necessary to deploy stable serverless serving of LLM on multi-GPU clusters with autoscaling. However, there exist challenges because the diversity and co-location of applications in multi-GPU clusters will lead to low service quality and GPU utilization. To address them, we build ENOVA, a deployment, mo…
▽ More
Since the increasing popularity of large language model (LLM) backend systems, it is common and necessary to deploy stable serverless serving of LLM on multi-GPU clusters with autoscaling. However, there exist challenges because the diversity and co-location of applications in multi-GPU clusters will lead to low service quality and GPU utilization. To address them, we build ENOVA, a deployment, monitoring and autoscaling service towards serverless LLM serving. ENOVA deconstructs the execution process of LLM service comprehensively, based on which ENOVA designs a configuration recommendation module for automatic deployment on any GPU clusters and a performance detection module for autoscaling. On top of them, ENOVA implements a deployment execution engine for multi-GPU cluster scheduling. The experiment results show that ENOVA significantly outperforms other state-of-the-art methods and is suitable for wide deployment in large online systems.
△ Less
Submitted 17 May, 2024;
originally announced July 2024.
-
How coronal mass ejections are influenced by the morphology and toroidal flux of their source magnetic flux ropes?
Authors:
J. H. Guo,
L. Linan,
S. Poedts,
Y. Guo,
B. Schmieder,
A. Lani,
Y. W. Ni,
M. Brchnelova,
B. Perri,
T. Baratashvili,
S. T. Li,
P. F. Chen
Abstract:
Coronal mass ejections (CMEs) stand as intense eruptions of magnetized plasma from the Sun, playing a pivotal role in driving significant changes of the heliospheric environment. Deducing the properties of CMEs from their progenitors in solar source regions is crucial for space weather forecasting. Deducing the properties of CMEs from their progenitors in solar source regions is crucial for space…
▽ More
Coronal mass ejections (CMEs) stand as intense eruptions of magnetized plasma from the Sun, playing a pivotal role in driving significant changes of the heliospheric environment. Deducing the properties of CMEs from their progenitors in solar source regions is crucial for space weather forecasting. Deducing the properties of CMEs from their progenitors in solar source regions is crucial for space weather forecasting. The primary objective of this paper is to establish a connection between CMEs and their progenitors in solar source regions, enabling us to infer the magnetic structures of CMEs before their full development. To this end, we create a dataset comprising a magnetic flux rope series with varying projection shapes, sizes and toroidal fluxes, using the Regularized Biot-Savart Laws (RBSL). Thereafter, we simulate the propagation of these flux ropes from the solar surface to a distance of 25$R_{\odot}$ with our global coronal MHD model which is named COCONUT. Our parametric survey reveals significant impacts of source flux ropes on the consequent CMEs. We find that the projection shape can influence the magnetic structures of CMEs at 20$R_{\odot}$, albeit with minimal impacts on the propagation speed. However, these impacts diminish as source flux ropes become fat. In terms of toroidal flux, our simulation results demonstrate a pronounced correlation with the propagation speed of CMEs, as well as the successfulness in erupting. This work builds the bridge between the CMEs in the outer corona and their progenitors in solar source regions. Our parametric survey suggests that the projection shape, cross-section radius and toroidal flux of source flux ropes are crucial parameters in predicting magnetic structures and propagation speed of CMEs, providing valuable insights for space weather prediction.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Quantitative diffusion approximation for the Neutral $r$-Alleles Wright-Fisher Model with Mutations
Authors:
Peng Chen,
Jie Xiong,
Lihu Xu,
Jiayu Zheng
Abstract:
We apply a Lindeberg principle under the Markov process setting to approximate the Wright-Fisher model with neutral $r$-alleles using a diffusion process, deriving an error rate based on a function class distance involving fourth-order bounded differentiable functions. This error rate consists of a linear combination of the maximum mutation rate and the reciprocal of the population size. Our resul…
▽ More
We apply a Lindeberg principle under the Markov process setting to approximate the Wright-Fisher model with neutral $r$-alleles using a diffusion process, deriving an error rate based on a function class distance involving fourth-order bounded differentiable functions. This error rate consists of a linear combination of the maximum mutation rate and the reciprocal of the population size. Our result improves the error bound in the seminal work [PNAS,1977], where only the special case $r=2$ was studied.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Data-Locality-Aware Task Assignment and Scheduling for Distributed Job Executions
Authors:
Hailiang Zhao,
Xueyan Tang,
Peng Chen,
Jianwei Yin,
Shuiguang Deng
Abstract:
This paper investigates a data-locality-aware task assignment and scheduling problem aimed at minimizing job completion times for distributed job executions. Without prior knowledge of future job arrivals, we propose an optimal balanced task assignment algorithm (OBTA) that minimizes the completion time of each arriving job. We significantly reduce OBTA's computational overhead by narrowing the se…
▽ More
This paper investigates a data-locality-aware task assignment and scheduling problem aimed at minimizing job completion times for distributed job executions. Without prior knowledge of future job arrivals, we propose an optimal balanced task assignment algorithm (OBTA) that minimizes the completion time of each arriving job. We significantly reduce OBTA's computational overhead by narrowing the search space of potential solutions. Additionally, we extend an approximate algorithm known as water-filling (WF) and nontrivially prove that its approximation factor equals the number of task groups in the job assignment. We also design a novel heuristic, replica-deletion (RD), which outperforms WF. To further reduce the completion time of each job, we expand the problem to include job reordering, where we adjust the order of outstanding jobs following the shortest-estimated-time-first policy. Extensive trace-driven evaluations validate the performance and efficiency of the proposed algorithms.
△ Less
Submitted 15 July, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.
-
Beyond Instruction Following: Evaluating Rule Following of Large Language Models
Authors:
Wangtao Sun,
Chenxiang Zhang,
Xueyou Zhang,
Ziyang Huang,
Haotian Xu,
Pei Chen,
Shizhu He,
Jun Zhao,
Kang Liu
Abstract:
Although Large Language Models (LLMs) have demonstrated strong instruction-following ability to be helpful, they are further supposed to be controlled and guided by rules in real-world scenarios to be safe, and accurate in responses. This demands the possession of rule-following capability of LLMs. However, few works have made a clear evaluation of the rule-following capability of LLMs. Previous s…
▽ More
Although Large Language Models (LLMs) have demonstrated strong instruction-following ability to be helpful, they are further supposed to be controlled and guided by rules in real-world scenarios to be safe, and accurate in responses. This demands the possession of rule-following capability of LLMs. However, few works have made a clear evaluation of the rule-following capability of LLMs. Previous studies that try to evaluate the rule-following capability of LLMs fail to distinguish the rule-following scenarios from the instruction-following scenarios. Therefore, this paper first makes a clarification of the concept of rule-following, and curates a comprehensive benchmark, RuleBench, to evaluate a diversified range of rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our further analysis provides insights into the improvements for LLMs toward a better rule-following intelligent agent. The data and code can be found at: https://anonymous.4open.science/r/llm-rule-following-B3E3/
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
A Two-stage Evolutionary Framework For Multi-objective Optimization
Authors:
Peng Chen,
Jing Liang,
Kangjia Qiao,
Ponnuthurai Nagaratnam Suganthan,
Xuanxuan Ban
Abstract:
In the field of evolutionary multi-objective optimization, the approximation of the Pareto front (PF) is achieved by utilizing a collection of representative candidate solutions that exhibit desirable convergence and diversity. Although several multi-objective evolutionary algorithms (MOEAs) have been designed, they still have difficulties in keeping balance between convergence and diversity of po…
▽ More
In the field of evolutionary multi-objective optimization, the approximation of the Pareto front (PF) is achieved by utilizing a collection of representative candidate solutions that exhibit desirable convergence and diversity. Although several multi-objective evolutionary algorithms (MOEAs) have been designed, they still have difficulties in keeping balance between convergence and diversity of population. To better solve multi-objective optimization problems (MOPs), this paper proposes a Two-stage Evolutionary Framework For Multi-objective Optimization (TEMOF). Literally, algorithms are divided into two stages to enhance the search capability of the population. During the initial half of evolutions, parental selection is exclusively conducted from the primary population. Additionally, we not only perform environmental selection on the current population, but we also establish an external archive to store individuals situated on the first PF. Subsequently, in the second stage, parents are randomly chosen either from the population or the archive. In the experiments, one classic MOEA and two state-of-the-art MOEAs are integrated into the framework to form three new algorithms. The experimental results demonstrate the superior and robust performance of the proposed framework across a wide range of MOPs. Besides, the winner among three new algorithms is compared with several existing MOEAs and shows better results. Meanwhile, we conclude the reasons that why the two-stage framework is effect for the existing benchmark functions.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
Authors:
Pingyi Chen,
Chenglu Zhu,
Sunyi Zheng,
Honglin Li,
Lin Yang
Abstract:
Whole slide imaging is routinely adopted for carcinoma diagnosis and prognosis. Abundant experience is required for pathologists to achieve accurate and reliable diagnostic results of whole slide images (WSI). The huge size and heterogeneous features of WSIs make the workflow of pathological reading extremely time-consuming. In this paper, we propose a novel framework (WSI-VQA) to interpret WSIs b…
▽ More
Whole slide imaging is routinely adopted for carcinoma diagnosis and prognosis. Abundant experience is required for pathologists to achieve accurate and reliable diagnostic results of whole slide images (WSI). The huge size and heterogeneous features of WSIs make the workflow of pathological reading extremely time-consuming. In this paper, we propose a novel framework (WSI-VQA) to interpret WSIs by generative visual question answering. WSI-VQA shows universality by reframing various kinds of slide-level tasks in a question-answering pattern, in which pathologists can achieve immunohistochemical grading, survival prediction, and tumor subtyping following human-machine interaction. Furthermore, we establish a WSI-VQA dataset which contains 8672 slide-level question-answering pairs with 977 WSIs. Besides the ability to deal with different slide-level tasks, our generative model which is named Wsi2Text Transformer (W2T) outperforms existing discriminative models in medical correctness, which reveals the potential of our model to be applied in the clinical scenario. Additionally, we also visualize the co-attention mapping between word embeddings and WSIs as an intuitive explanation for diagnostic results. The dataset and related code are available at https://github.com/cpystan/WSI-VQA.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
HuntFUZZ: Enhancing Error Handling Testing through Clustering Based Fuzzing
Authors:
Jin Wei,
Ping Chen,
Jun Dai,
Xiaoyan Sun,
Zhihao Zhang,
Chang Xu,
Yi Wanga
Abstract:
Testing a program's capability to effectively handling errors is a significant challenge, given that program errors are relatively uncommon. To solve this, Software Fault Injection (SFI)-based fuzzing integrates SFI and traditional fuzzing, injecting and triggering errors for testing (error handling) code. However, we observe that current SFI-based fuzzing approaches have overlooked the correlatio…
▽ More
Testing a program's capability to effectively handling errors is a significant challenge, given that program errors are relatively uncommon. To solve this, Software Fault Injection (SFI)-based fuzzing integrates SFI and traditional fuzzing, injecting and triggering errors for testing (error handling) code. However, we observe that current SFI-based fuzzing approaches have overlooked the correlation between paths housing error points. In fact, the execution paths of error points often share common paths. Nonetheless, Fuzzers usually generate test cases repeatedly to test error points on commonly traversed paths. This practice can compromise the efficiency of the fuzzer(s). Thus, this paper introduces HuntFUZZ, a novel SFI-based fuzzing framework that addresses the issue of redundant testing of error points with correlated paths. Specifically, HuntFUZZ clusters these correlated error points and utilizes concolic execution to compute constraints only for common paths within each cluster. By doing so, we provide the fuzzer with efficient test cases to explore related error points with minimal redundancy. We evaluate HuntFUZZ on a diverse set of 42 applications, and HuntFUZZ successfully reveals 162 known bugs, with 62 of them being related to error handling. Additionally, due to its efficient error point detection method, HuntFUZZ discovers 7 unique zero-day bugs, which are all missed by existing fuzzers. Furthermore, we compare HuntFUZZ with 4 existing fuzzing approaches, including AFL, AFL++, AFLGo, and EH-FUZZ. Our evaluation confirms that HuntFUZZ can cover a broader range of error points, and it exhibits better performance in terms of bug finding speed.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
SQLaser: Detecting DBMS Logic Bugs with Clause-Guided Fuzzing
Authors:
Jin Wei,
Ping Chen,
Kangjie Lu,
Jun Dai,
Xiaoyan Sun
Abstract:
Database Management Systems (DBMSs) are vital components in modern data-driven systems. Their complexity often leads to logic bugs, which are implementation errors within the DBMSs that can lead to incorrect query results, data exposure, unauthorized access, etc., without necessarily causing visible system failures. Existing detection employs two strategies: rule-based bug detection and coverage-g…
▽ More
Database Management Systems (DBMSs) are vital components in modern data-driven systems. Their complexity often leads to logic bugs, which are implementation errors within the DBMSs that can lead to incorrect query results, data exposure, unauthorized access, etc., without necessarily causing visible system failures. Existing detection employs two strategies: rule-based bug detection and coverage-guided fuzzing. In general, rule specification itself is challenging; as a result, rule-based detection is limited to specific and simple rules. Coverage-guided fuzzing blindly explores code paths or blocks, many of which are unlikely to contain logic bugs; therefore, this strategy is cost-ineffective. In this paper, we design SQLaser, a SQL-clause-guided fuzzer for detecting logic bugs in DBMSs. Through a comprehensive examination of most existing logic bugs across four distinct DBMSs, excluding those causing system crashes, we have identified 35 logic bug patterns. These patterns manifest as certain SQL clause combinations that commonly result in logic bugs, and behind these clause combinations are a sequence of functions. We therefore model logic bug patterns as error-prone function chains (ie, sequences of functions). We further develop a directed fuzzer with a new path-to-path distance-calculation mechanism for effectively testing these chains and discovering additional logic bugs. This mechanism enables SQLaser to swiftly navigate to target sites and uncover potential bugs emerging from these paths. Our evaluation, conducted on SQLite, MySQL, PostgreSQL, and TiDB, demonstrates that SQLaser significantly accelerates bug discovery compared to other fuzzing approaches, reducing detection time by approximately 60%.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Reduced-Order Neural Operators: Learning Lagrangian Dynamics on Highly Sparse Graphs
Authors:
Hrishikesh Viswanath,
Yue Chang,
Julius Berner,
Peter Yichen Chen,
Aniket Bera
Abstract:
We present a neural operator architecture to simulate Lagrangian dynamics, such as fluid flow, granular flows, and elastoplasticity. Traditional numerical methods, such as the finite element method (FEM), suffer from long run times and large memory consumption. On the other hand, approaches based on graph neural networks are faster but still suffer from long computation times on dense graphs, whic…
▽ More
We present a neural operator architecture to simulate Lagrangian dynamics, such as fluid flow, granular flows, and elastoplasticity. Traditional numerical methods, such as the finite element method (FEM), suffer from long run times and large memory consumption. On the other hand, approaches based on graph neural networks are faster but still suffer from long computation times on dense graphs, which are often required for high-fidelity simulations. Our model, GIOROM or Graph Interaction Operator for Reduced-Order Modeling, learns temporal dynamics within a reduced-order setting, capturing spatial features from a highly sparse graph representation of the input and generalizing to arbitrary spatial locations during inference. The model is geometry-aware and discretization-agnostic and can generalize to different initial conditions, velocities, and geometries after training. We show that point clouds of the order of 100,000 points can be inferred from sparse graphs with $\sim$1000 points, with negligible change in computation time. We empirically evaluate our model on elastic solids, Newtonian fluids, Non-Newtonian fluids, Drucker-Prager granular flows, and von Mises elastoplasticity. On these benchmarks, our approach results in a 25$\times$ speedup compared to other neural network-based physics simulators while delivering high-fidelity predictions of complex physical systems and showing better performance on most benchmarks. The code and the demos are provided at https://github.com/HrishikeshVish/GIOROM.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
An elementary approach based on variational inequalities for modelling a friction-based locomotion problem
Authors:
Panyu Chen,
Alvaro Mateos Gonzalez,
Laurent Mertz
Abstract:
We propose an elementary proof based on a penalization technique to show the existence and uniqueness of the solution to a system of variational inequalities modelling the friction-based motion of a two-body crawling system. Here for each body, the static and dynamic friction coefficients are equal.
We propose an elementary proof based on a penalization technique to show the existence and uniqueness of the solution to a system of variational inequalities modelling the friction-based motion of a two-body crawling system. Here for each body, the static and dynamic friction coefficients are equal.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
A Survey of Data Synthesis Approaches
Authors:
Hsin-Yu Chang,
Pei-Yu Chen,
Tun-Hsiang Chou,
Chang-Sheng Kao,
Hsuan-Yun Yu,
Yen-Ting Lin,
Yun-Nung Chen
Abstract:
This paper provides a detailed survey of synthetic data techniques. We first discuss the expected goals of using synthetic data in data augmentation, which can be divided into four parts: 1) Improving Diversity, 2) Data Balancing, 3) Addressing Domain Shift, and 4) Resolving Edge Cases. Synthesizing data are closely related to the prevailing machine learning techniques at the time, therefore, we s…
▽ More
This paper provides a detailed survey of synthetic data techniques. We first discuss the expected goals of using synthetic data in data augmentation, which can be divided into four parts: 1) Improving Diversity, 2) Data Balancing, 3) Addressing Domain Shift, and 4) Resolving Edge Cases. Synthesizing data are closely related to the prevailing machine learning techniques at the time, therefore, we summarize the domain of synthetic data techniques into four categories: 1) Expert-knowledge, 2) Direct Training, 3) Pre-train then Fine-tune, and 4) Foundation Models without Fine-tuning. Next, we categorize the goals of synthetic data filtering into four types for discussion: 1) Basic Quality, 2) Label Consistency, and 3) Data Distribution. In section 5 of this paper, we also discuss the future directions of synthetic data and state three direction that we believe is important: 1) focus more on quality, 2) the evaluation of synthetic data, and 3) multi-model data augmentation.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Direct evidence of hybrid nature of EUV waves and the reflection of the fast-mode wave
Authors:
Ramesh Chandra,
P. F. Chen,
Pooja Devi
Abstract:
We performed an analysis of the extreme-ultraviolet (EUV) wave event on 2022 March 31. The event originated from active region (AR) 12975 located at N13W52 in the field of view of the Atmospheric imaging Assembly (AIA) and exactly at the west limb viewed by the EUV Imager (EUVI) of the Solar Terrestrial Relations Observatory-Ahead (STEREO-A) satellite. The EUV wave was associated with an M9.6 clas…
▽ More
We performed an analysis of the extreme-ultraviolet (EUV) wave event on 2022 March 31. The event originated from active region (AR) 12975 located at N13W52 in the field of view of the Atmospheric imaging Assembly (AIA) and exactly at the west limb viewed by the EUV Imager (EUVI) of the Solar Terrestrial Relations Observatory-Ahead (STEREO-A) satellite. The EUV wave was associated with an M9.6 class flare. The event was also well observed by MLSO and COR1 coronagraphs. We revealed here evident coexistence of two components of EUV waves in AIA as well as in EUVI images i.e., a fast-mode wave and a nonwave, which was predicted by the EUV wave hybrid model. The speeds of the fast-mode and non wave EUV wave components in AIA varies from ~430 to 658 km/s and ~157 to 205 km/s, respectively. The computed speeds in STEREO-A for the fast-mode wave and nonwave components are ~520 and ~152 km/s, respectively. Another wave emanated from the source AR and interacted with ambient coronal loops, showing evident reflection in the EUV images above the solar limb. The speed of the reflected wave in the plane of the sky is ~175 km/s. With the precise alignments, we found that the fast-mode EUV wave is just ahead of the coronal mass ejection (CME) and the nonwave component is cospatial with the frontal loop of the accompanied CME. The event also showed stationary fronts.
△ Less
Submitted 6 July, 2024; v1 submitted 3 July, 2024;
originally announced July 2024.
-
MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders
Authors:
Baijiong Lin,
Weisen Jiang,
Pengguang Chen,
Yu Zhang,
Shu Liu,
Ying-Cong Chen
Abstract:
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-t…
▽ More
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best methods in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at https://github.com/EnVision-Research/MTMamba.
△ Less
Submitted 14 July, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
Unveiling Mass Transfer in Solar Flares: Insights from Elemental Abundance Evolutions Observed by Chang'E-2 Solar X-ray Monitor
Authors:
Man-Hei Ng,
Chi-Long Tang,
Xiaoping Zhang,
Kuan-Vai Tam,
Peng-Fei Chen,
Wudong Dong,
Jing Li,
Chi-Pui Tang
Abstract:
Understanding how elemental abundances evolve during solar flares helps shed light on the mass and energy transfer between different solar atmospheric layers. However, prior studies have mostly concentrated on averaged abundances or specific flare phases, leaving a gap in exploring the comprehensive observations throughout the entire flare process. Consequently, investigations into this area are r…
▽ More
Understanding how elemental abundances evolve during solar flares helps shed light on the mass and energy transfer between different solar atmospheric layers. However, prior studies have mostly concentrated on averaged abundances or specific flare phases, leaving a gap in exploring the comprehensive observations throughout the entire flare process. Consequently, investigations into this area are relatively scarce. Exploiting the Solar X-ray Monitor data obtained from the Chang'E-2 lunar orbiter, we present two comprehensive soft X-ray spectroscopic observations of flares in active regions, AR 11149 and 11158, demonstrating elemental abundance evolutions under different conditions. Our findings unveil the inverse first ionization potential (IFIP) effect during flares for Fe for the first time, and reaffirm its existence for Si. Additionally, we observed a rare depletion of elemental abundances, marking the second IFIP effect in flare decay phases. Our study offers a CSHKP model-based interpretation to elucidate the formation of both the FIP and IFIP effects in flare dynamics, with the inertia effect being incorporated into the ponderomotive force fractionation model.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Representing Arbitrary Ground States of Toric Code by Restricted Boltzmann Machine
Authors:
Penghua Chen,
Bowen Yan,
Shawn X. Cui
Abstract:
We systematically analyze the representability of toric code ground states by Restricted Boltzmann Machine with only local connections between hidden and visible neurons. This analysis is pivotal for evaluating the model's capability to represent diverse ground states, thus enhancing our understanding of its strengths and weaknesses. Subsequently, we modify the Restricted Boltzmann Machine to acco…
▽ More
We systematically analyze the representability of toric code ground states by Restricted Boltzmann Machine with only local connections between hidden and visible neurons. This analysis is pivotal for evaluating the model's capability to represent diverse ground states, thus enhancing our understanding of its strengths and weaknesses. Subsequently, we modify the Restricted Boltzmann Machine to accommodate arbitrary ground states by introducing essential non-local connections efficiently. The new model is not only analytically solvable but also demonstrates efficient and accurate performance when solved using machine learning techniques. Then we generalize our the model from $Z_2$ to $Z_n$ toric code and discuss future directions.
△ Less
Submitted 15 July, 2024; v1 submitted 1 July, 2024;
originally announced July 2024.
-
Active-RIS-Aided Covert Communications in NOMA-Inspired ISAC Wireless Systems
Authors:
Miaomiao Zhu,
Pengxu Chen,
Liang Yang,
Alexandros-Apostolos A. Boulogeorgos,
Theodoros A. Tsiftsis,
Hongwu Liu
Abstract:
Non-orthogonal multiple access (NOMA)-inspired integrated sensing and communication (ISAC) facilitates spectrum sharing for radar sensing and NOMA communications, whereas facing privacy and security challenges due to open wireless propagation. In this paper, active reconfigurable intelligent surface (RIS) is employed to aid covert communications in NOMA-inspired ISAC wireless system with the aim o…
▽ More
Non-orthogonal multiple access (NOMA)-inspired integrated sensing and communication (ISAC) facilitates spectrum sharing for radar sensing and NOMA communications, whereas facing privacy and security challenges due to open wireless propagation. In this paper, active reconfigurable intelligent surface (RIS) is employed to aid covert communications in NOMA-inspired ISAC wireless system with the aim of maximizing the covert rate. Specifically, a dual-function base-station (BS) transmits the superposition signal to sense multiple targets, while achieving covert and reliable communications for a pair of NOMA covert and public users, respectively, in the presence of a warden. Two superposition transmission schemes, namely, the transmissions with dedicated sensing signal (w-DSS) and without dedicated sensing signal (w/o-DSS), are respectively considered in the formulations of the joint transmission and reflection beamforming optimization problems. Numerical results demonstrate that active-RIS-aided NOMA-ISAC system outperforms the passive-RIS-aided and without-RIS counterparts in terms of covert rate and trade-off between covert communication and sensing performance metrics. Finally, the w/o-DSS scheme, which omits the dedicated sensing signal, achieves a higher covert rate than the w-DSS scheme by allocating more transmit power for the covert transmissions, while preserving a comparable multi-target sensing performance.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
Dark Superabsorbers with Dirac-delta-like superdirective radiation
Authors:
Jeng Yi Lee,
Irving Rondon,
Andrey E. Miroshnichenko,
Pai-Yen Chen
Abstract:
We theoretically and numerically reveal that under a given level of extinction cross section and with definite angular momentum channels dominant, there exists a physical limitation for absorption cross section being maximum and scattering cross section being minimum. In addition, any scattering systems operated at this condition would be accompanied by a needle Dirac-delta-like far-field radiatio…
▽ More
We theoretically and numerically reveal that under a given level of extinction cross section and with definite angular momentum channels dominant, there exists a physical limitation for absorption cross section being maximum and scattering cross section being minimum. In addition, any scattering systems operated at this condition would be accompanied by a needle Dirac-delta-like far-field radiation pattern, reducing to perturb the background field except in the forward direction. We therefore refer to this outcome as dark superabsorbers. Moreover, by considering the mathematical Gibbs phenomenon, we find that a completely equivalent Dirac-delta far-field radiation is excluded even we could properly design the scatterers operated at such conditions. We believe this finding has potential applications in design of dark energy harvesting, lower-visibility receivers, superdirective light-matter interaction, and Fresnel diffractive imaging.
△ Less
Submitted 28 June, 2024;
originally announced July 2024.
-
A Survey on Failure Analysis and Fault Injection in AI Systems
Authors:
Guangba Yu,
Gou Tan,
Haojia Huang,
Zhenyu Zhang,
Pengfei Chen,
Roberto Natella,
Zibin Zheng
Abstract:
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ens…
▽ More
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ensure resilience and reliability. Despite the importance of these techniques, there lacks a comprehensive review of FA and FI methodologies in AI systems. This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems. We systematically analyze 160 papers and repositories to answer three research questions including (1) what are the prevalent failures in AI systems, (2) what types of faults can current FI tools simulate, (3) what gaps exist between the simulated faults and real-world failures. Our findings reveal a taxonomy of AI system failures, assess the capabilities of existing FI tools, and highlight discrepancies between real-world and simulated failures. Moreover, this survey contributes to the field by providing a framework for fault diagnosis, evaluating the state-of-the-art in FI, and identifying areas for improvement in FI techniques to enhance the resilience of AI systems.
△ Less
Submitted 27 June, 2024;
originally announced July 2024.
-
Vision Transformer with Key-select Routing Attention for Single Image Dehazing
Authors:
Lihan Tong,
Weijia Li,
Qingxia Yang,
Liyuan Chen,
Peng Chen
Abstract:
We present Ksformer, utilizing Multi-scale Key-select Routing Attention (MKRA) for intelligent selection of key areas through multi-channel, multi-scale windows with a top-k operator, and Lightweight Frequency Processing Module (LFPM) to enhance high-frequency features, outperforming other dehazing methods in tests.
We present Ksformer, utilizing Multi-scale Key-select Routing Attention (MKRA) for intelligent selection of key areas through multi-channel, multi-scale windows with a top-k operator, and Lightweight Frequency Processing Module (LFPM) to enhance high-frequency features, outperforming other dehazing methods in tests.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness
Authors:
Erh-Chung Chen,
Pin-Yu Chen,
I-Hsin Chung,
Che-Rung Lee
Abstract:
The security and robustness of deep neural networks (DNNs) have become increasingly concerning. This paper aims to provide both a theoretical foundation and a practical solution to ensure the reliability of DNNs. We explore the concept of Lipschitz continuity to certify the robustness of DNNs against adversarial attacks, which aim to mislead the network with adding imperceptible perturbations into…
▽ More
The security and robustness of deep neural networks (DNNs) have become increasingly concerning. This paper aims to provide both a theoretical foundation and a practical solution to ensure the reliability of DNNs. We explore the concept of Lipschitz continuity to certify the robustness of DNNs against adversarial attacks, which aim to mislead the network with adding imperceptible perturbations into inputs. We propose a novel algorithm that remaps the input domain into a constrained range, reducing the Lipschitz constant and potentially enhancing robustness. Unlike existing adversarially trained models, where robustness is enhanced by introducing additional examples from other datasets or generative models, our method is almost cost-free as it can be integrated with existing models without requiring re-training. Experimental results demonstrate the generalizability of our method, as it can be combined with various models and achieve enhancements in robustness. Furthermore, our method achieves the best robust accuracy for CIFAR10, CIFAR100, and ImageNet datasets on the RobustBench leaderboard.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
PUREPath: A Deep Latent Variational Model for Estimating CMB Posterior over Large Angular Scales of the Sky
Authors:
Vipin Sudevan,
Pisin Chen
Abstract:
We present a comprehensive neural architecture, the PUREPath, which leverages a nested Probabilistic multi-modal U- Net framework, augmented by the inclusion of probabilistic ResNet blocks in the Expanding Pathway of the decoders, to estimate the posterior density of the Cosmic Microwave Background (CMB) signal conditioned on the observed CMB data and the training dataset. By seamlessly integratin…
▽ More
We present a comprehensive neural architecture, the PUREPath, which leverages a nested Probabilistic multi-modal U- Net framework, augmented by the inclusion of probabilistic ResNet blocks in the Expanding Pathway of the decoders, to estimate the posterior density of the Cosmic Microwave Background (CMB) signal conditioned on the observed CMB data and the training dataset. By seamlessly integrating Bayesian statistics and variational methods our model effectively minimizes foreground contamination in the observed CMB maps. The model is trained using foreground and noise contaminated CMB temperature maps simulated at Planck LFI and HFI frequency channels 30 - 353 GHz using publicly available Code for Anisotropies in the Microwave Background (CAMB) and Python Sky Model (PySM) packages. During training, our model transforms initial prior distribution on the model parameters to posterior distributions based on the training data. From the joint full posterior of the model parameters, during inference, a predicitve CMB posterior and summary statistics such as the predictive mean, variance etc of the cleaned CMB map is estimated. The predictive standard deviation map provides a direct and interpretable measure of uncertainty per pixel in the predicted mean CMB map. The cleaned CMB map along with the error estimates can be used for more accurate measurements of cosmological parameters and other cosmological analyses.
△ Less
Submitted 28 June, 2024; v1 submitted 27 June, 2024;
originally announced June 2024.
-
Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study
Authors:
Peikun Chen,
Sining Sun,
Changhao Shan,
Qing Yang,
Lei Xie
Abstract:
Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire s…
▽ More
Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Human-free Prompted Based Anomaly Detection: prompt optimization with Meta-guiding prompt scheme
Authors:
Pi-Wei Chen,
Jerry Chun-Wei Lin,
Jia Ji,
Feng-Hao Yeh,
Chao-Chun Chen
Abstract:
Pre-trained vision-language models (VLMs) are highly adaptable to various downstream tasks through few-shot learning, making prompt-based anomaly detection a promising approach. Traditional methods depend on human-crafted prompts that require prior knowledge of specific anomaly types. Our goal is to develop a human-free prompt-based anomaly detection framework that optimally learns prompts through…
▽ More
Pre-trained vision-language models (VLMs) are highly adaptable to various downstream tasks through few-shot learning, making prompt-based anomaly detection a promising approach. Traditional methods depend on human-crafted prompts that require prior knowledge of specific anomaly types. Our goal is to develop a human-free prompt-based anomaly detection framework that optimally learns prompts through data-driven methods, eliminating the need for human intervention. The primary challenge in this approach is the lack of anomalous samples during the training phase. Additionally, the Vision Transformer (ViT)-based image encoder in VLMs is not ideal for pixel-wise anomaly segmentation due to a locality feature mismatch between the original image and the output feature map. To tackle the first challenge, we have developed the Object-Attention Anomaly Generation Module (OAGM) to synthesize anomaly samples for training. Furthermore, our Meta-Guiding Prompt-Tuning Scheme (MPTS) iteratively adjusts the gradient-based optimization direction of learnable prompts to avoid overfitting to the synthesized anomalies. For the second challenge, we propose Locality-Aware Attention, which ensures that each local patch feature attends only to nearby patch features, preserving the locality features corresponding to their original locations. This framework allows for the optimal prompt embeddings by searching in the continuous latent space via backpropagation, free from human semantic constraints. Additionally, the modified locality-aware attention improves the precision of pixel-wise anomaly segmentation.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis
Authors:
Hongkang Li,
Meng Wang,
Shuai Zhang,
Sijia Liu,
Pin-Yu Chen
Abstract:
Efficient training and inference algorithms, such as low-rank adaption and model pruning, have shown impressive performance for learning Transformer-based large foundation models. However, due to the technical challenges of the non-convex optimization caused by the complicated architecture of Transformers, the theoretical study of why these methods can be applied to learn Transformers is mostly el…
▽ More
Efficient training and inference algorithms, such as low-rank adaption and model pruning, have shown impressive performance for learning Transformer-based large foundation models. However, due to the technical challenges of the non-convex optimization caused by the complicated architecture of Transformers, the theoretical study of why these methods can be applied to learn Transformers is mostly elusive. To the best of our knowledge, this paper shows the first theoretical analysis of the property of low-rank and sparsity of one-layer Transformers by characterizing the trained model after convergence using stochastic gradient descent. By focusing on a data model based on label-relevant and label-irrelevant patterns, we quantify that the gradient updates of trainable parameters are low-rank, which depends on the number of label-relevant patterns. We also analyze how model pruning affects the generalization while improving computation efficiency and conclude that proper magnitude-based pruning has a slight effect on the testing performance. We implement numerical experiments to support our findings.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Feature Purified Transformer With Cross-level Feature Guiding Decoder For Multi-class OOD and Anomaly Deteciton
Authors:
Jerry Chun-Wei Lin,
Pi-Wei Chen,
Chao-Chun Chen
Abstract:
Reconstruction networks are prevalently used in unsupervised anomaly and Out-of-Distribution (OOD) detection due to their independence from labeled anomaly data. However, in multi-class datasets, the effectiveness of anomaly detection is often compromised by the models' generalized reconstruction capabilities, which allow anomalies to blend within the expanded boundaries of normality resulting fro…
▽ More
Reconstruction networks are prevalently used in unsupervised anomaly and Out-of-Distribution (OOD) detection due to their independence from labeled anomaly data. However, in multi-class datasets, the effectiveness of anomaly detection is often compromised by the models' generalized reconstruction capabilities, which allow anomalies to blend within the expanded boundaries of normality resulting from the added categories, thereby reducing detection accuracy. We introduce the FUTUREG framework, which incorporates two innovative modules: the Feature Purification Module (FPM) and the CFG Decoder. The FPM constrains the normality boundary within the latent space to effectively filter out anomalous features, while the CFG Decoder uses layer-wise encoder representations to guide the reconstruction of filtered features, preserving fine-grained details. Together, these modules enhance the reconstruction error for anomalies, ensuring high-quality reconstructions for normal samples. Our results demonstrate that FUTUREG achieves state-of-the-art performance in multi-class OOD settings and remains competitive in industrial anomaly detection scenarios.
△ Less
Submitted 30 April, 2024;
originally announced June 2024.
-
Reflective Liquid-Crystal Phase Shifter based on Periodically Loaded Differential Microstrip Lines
Authors:
Yuh-Chyi Chang,
Tien-Lun Ting,
Pei-Ru Chen,
Tsung-Hsien Lin
Abstract:
High-performance phase control units are crucial in beamforming technology, which has gained substantial attention for its ability to manipulate the wireless propagation environment, thereby enhancing capacity and coverage in communication networks. This paper presents the design and fabrication of a 3.5GHz reflective liquid-crystal (LC) phase shifter. The phase shifter is constructed using coplan…
▽ More
High-performance phase control units are crucial in beamforming technology, which has gained substantial attention for its ability to manipulate the wireless propagation environment, thereby enhancing capacity and coverage in communication networks. This paper presents the design and fabrication of a 3.5GHz reflective liquid-crystal (LC) phase shifter. The phase shifter is constructed using coplanar differential lines, periodically loaded with floating electrodes. The LCs in the overlapping areas act as variable capacitors, and the continuous phase shift can be adjusted by applying AC to the permittivity in these areas. Both simulation and measurement results demonstrate impressive Figures of Merit (FoM) of 101.3 degrees per dB and 85.7 degrees per dB, respectively. The grounding issues typically associated with coplanar waveguides (CPWs) on glass substrates are effectively mitigated by employing the virtual ground in a differential pair configuration. The innovative reflective-type operation minimizes the unit cell size and allows for low-cost manufacturing of phase shifter arrays and advances the practical development of beamforming technology.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models
Authors:
Zhongshen Zeng,
Yinhong Liu,
Yingjia Wan,
Jingyao Li,
Pengguang Chen,
Jianbo Dai,
Yuxuan Yao,
Rongwu Xu,
Zehan Qi,
Wanru Zhao,
Linling Shen,
Jianqiao Lu,
Haochen Tan,
Yukang Chen,
Hao Zhang,
Zhan Shi,
Bailin Wang,
Zhijiang Guo,
Jiaya Jia
Abstract:
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, it has been increasingly challenging to evaluate the reasoning capability of LLMs. Concretely, existing outcome-based benchmarks begin to saturate and become less sufficient to monitor the progress. To this end, we pr…
▽ More
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, it has been increasingly challenging to evaluate the reasoning capability of LLMs. Concretely, existing outcome-based benchmarks begin to saturate and become less sufficient to monitor the progress. To this end, we present a process-based benchmark MR-BEN that demands a meta reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. MR-BEN is a comprehensive benchmark comprising 5,975 questions collected from human experts, covering various subjects such as physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, open-source models are seemingly comparable to GPT-4 on outcome-based benchmarks, but they lag far behind on our benchmark, revealing the underlying reasoning capability gap between them. Our dataset and codes are available on https://randolph-zeng.github.io/Mr-Ben.github.io/.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?
Authors:
Pinzhen Chen,
Simon Yu,
Zhicheng Guo,
Barry Haddow
Abstract:
Large language models, particularly multilingual ones, are designed, claimed, and expected to cater to native speakers of varied languages. We hypothesise that the current practices of fine-tuning and evaluating these models may not perfectly align with this objective owing to a heavy reliance on translation, which can introduce translation artefacts and defects. It remains unknown whether the nat…
▽ More
Large language models, particularly multilingual ones, are designed, claimed, and expected to cater to native speakers of varied languages. We hypothesise that the current practices of fine-tuning and evaluating these models may not perfectly align with this objective owing to a heavy reliance on translation, which can introduce translation artefacts and defects. It remains unknown whether the nature of the instruction data has an impact on the model output; conversely, it is questionable whether translated test sets can capture such nuances. Due to the often coupled practices of using translated data in both stages, such imperfections could have been overlooked. This work investigates these issues using controlled native or translated data during instruction tuning and evaluation stages. Experiments on eight base models and eight different benchmarks show that native or generation benchmarks reveal a notable difference between native and translated instruction data especially when model performance is high, whereas other types of test sets cannot. The comparison between round-trip and single-pass translations reflects the importance of knowledge from language-native resources. Finally, we demonstrate that regularization is beneficial to bridging this gap on structured but not generative tasks.
△ Less
Submitted 11 July, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Authors:
Wenbin An,
Feng Tian,
Sicong Leng,
Jiahao Nie,
Haonan Lin,
QianYing Wang,
Guang Dai,
Ping Chen,
Shijian Lu
Abstract:
Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of objec…
▽ More
Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of object hallucinations. Specifically, LVLMs predominantly attend to prompt-independent global image features, while failing to capture prompt-relevant local features, consequently undermining the visual grounding capacity of LVLMs and leading to hallucinations. To this end, we propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates object hallucinations by exploring an ensemble of global features for response generation and local features for visual discrimination simultaneously. Our approach exhibits an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is reserved while irrelevant distractions are masked. With the augmented view, a calibrated decoding distribution can be derived by integrating generative global features from the original image and discriminative local features from the augmented image. Extensive experiments show that AGLA consistently mitigates object hallucinations and enhances general perception capability for LVLMs across various discriminative and generative benchmarks. Our code will be released at https://github.com/Lackel/AGLA.
△ Less
Submitted 21 June, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
UIFV: Data Reconstruction Attack in Vertical Federated Learning
Authors:
Jirui Yang,
Peng Chen,
Zhihui Lu,
Qiang Duan,
Yubing Bao
Abstract:
Vertical Federated Learning (VFL) facilitates collaborative machine learning without the need for participants to share raw private data. However, recent studies have revealed privacy risks where adversaries might reconstruct sensitive features through data leakage during the learning process. Although data reconstruction methods based on gradient or model information are somewhat effective, they…
▽ More
Vertical Federated Learning (VFL) facilitates collaborative machine learning without the need for participants to share raw private data. However, recent studies have revealed privacy risks where adversaries might reconstruct sensitive features through data leakage during the learning process. Although data reconstruction methods based on gradient or model information are somewhat effective, they reveal limitations in VFL application scenarios. This is because these traditional methods heavily rely on specific model structures and/or have strict limitations on application scenarios. To address this, our study introduces the Unified InverNet Framework into VFL, which yields a novel and flexible approach (dubbed UIFV) that leverages intermediate feature data to reconstruct original data, instead of relying on gradients or model details. The intermediate feature data is the feature exchanged by different participants during the inference phase of VFL. Experiments on four datasets demonstrate that our methods significantly outperform state-of-the-art techniques in attack precision. Our work exposes severe privacy vulnerabilities within VFL systems that pose real threats to practical VFL applications and thus confirms the necessity of further enhancing privacy protection in the VFL architecture.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
MDCR: A Dataset for Multi-Document Conditional Reasoning
Authors:
Peter Baile Chen,
Yi Zhang,
Chunwei Liu,
Sejal Gupta,
Yoon Kim,
Michael Cafarella
Abstract:
The same real-life questions posed to different individuals may lead to different answers based on their unique situations. For instance, whether a student is eligible for a scholarship depends on eligibility conditions, such as major or degree required. ConditionalQA was proposed to evaluate models' capability of reading a document and answering eligibility questions, considering unmentioned cond…
▽ More
The same real-life questions posed to different individuals may lead to different answers based on their unique situations. For instance, whether a student is eligible for a scholarship depends on eligibility conditions, such as major or degree required. ConditionalQA was proposed to evaluate models' capability of reading a document and answering eligibility questions, considering unmentioned conditions. However, it is limited to questions on single documents, neglecting harder cases that may require cross-document reasoning and optimization, for example, "What is the maximum number of scholarships attainable?" Such questions over multiple documents are not only more challenging due to more context having to understand, but also because the model has to (1) explore all possible combinations of unmentioned conditions and (2) understand the relationship between conditions across documents, to reason about the optimal outcome. To evaluate models' capability of answering such questions, we propose a new dataset MDCR, which can reflect real-world challenges and serve as a new test bed for complex conditional reasoning that requires optimization. We evaluate this dataset using the most recent LLMs and demonstrate their limitations in solving this task. We believe this dataset will facilitate future research in answering optimization questions with unknown conditions.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Technique Report of CVPR 2024 PBDL Challenges
Authors:
Ying Fu,
Yu Li,
Shaodi You,
Boxin Shi,
Linwei Chen,
Yunhao Zou,
Zichun Wang,
Yichen Li,
Yuze Han,
Yingkai Zhang,
Jianan Wang,
Qinglin Liu,
Wei Yu,
Xiaoqian Lv,
Jianing Li,
Shengping Zhang,
Xiangyang Ji,
Yuanpei Chen,
Yuhan Zhang,
Weihang Peng,
Liwen Zhang,
Zhe Xu,
Dingyong Gou,
Cong Li,
Senyan Xu
, et al. (75 additional authors not shown)
Abstract:
The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, a…
▽ More
The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, and medium properties from images. In recent years, deep learning has shown promising improvements for various vision tasks, and when combined with physics-based vision, these approaches can enhance the robustness and accuracy of vision systems. This technical report summarizes the outcomes of the Physics-Based Vision Meets Deep Learning (PBDL) 2024 challenge, held in CVPR 2024 workshop. The challenge consisted of eight tracks, focusing on Low-Light Enhancement and Detection as well as High Dynamic Range (HDR) Imaging. This report details the objectives, methodologies, and results of each track, highlighting the top-performing solutions and their innovative approaches.
△ Less
Submitted 12 July, 2024; v1 submitted 15 June, 2024;
originally announced June 2024.
-
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
Authors:
Chenyu Zhou,
Mengdan Zhang,
Peixian Chen,
Chaoyou Fu,
Yunhang Shen,
Xiawu Zheng,
Xing Sun,
Rongrong Ji
Abstract:
The swift progress of Multi-modal Large Models (MLLMs) has showcased their impressive ability to tackle tasks blending vision and language. Yet, most current models and benchmarks cater to scenarios with a narrow scope of visual and textual contexts. These models often fall short when faced with complex comprehension tasks, which involve navigating through a plethora of irrelevant and potentially…
▽ More
The swift progress of Multi-modal Large Models (MLLMs) has showcased their impressive ability to tackle tasks blending vision and language. Yet, most current models and benchmarks cater to scenarios with a narrow scope of visual and textual contexts. These models often fall short when faced with complex comprehension tasks, which involve navigating through a plethora of irrelevant and potentially misleading information in both text and image forms. To bridge this gap, we introduce a new, more demanding task known as Interleaved Image-Text Comprehension (IITC). This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions and to follow intricate instructions to pinpoint the relevant image. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA), to refine image-text correlation skills. Our evaluation of four leading closed-source models, as well as various open-source models using VEGA, underscores the rigorous nature of IITC. Even the most advanced models, such as Gemini-1.5-pro and GPT4V, only achieved modest success. By employing a multi-task, multi-scale post-training strategy, we have set a robust baseline for MLLMs on the IITC task, attaining an $85.8\%$ accuracy rate in image association and a $0.508$ Rouge score. These results validate the effectiveness of our dataset in improving MLLMs capabilities for nuanced image-text comprehension.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models
Authors:
Yan Liu,
Yu Liu,
Xiaokang Chen,
Pin-Yu Chen,
Daoguang Zan,
Min-Yen Kan,
Tsung-Yi Ho
Abstract:
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing me…
▽ More
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing methods mainly finetune or even pre-train language models on newly constructed anti-stereotypical datasets, which are high-cost. In this work, we try to unveil the mystery of social bias inside language models by introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose {\sc Integrated Gap Gradients (IG$^2$)} to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. By formalizing undesirable behavior as a distributional property of language, we employ sentiment-bearing prompts to elicit classes of sensitive words (demographics) correlated with such sentiments. Our IG$^2$ thus attributes the uneven distribution for different demographics to specific Social Bias Neurons, which track the trail of unwanted behavior inside PLM units to achieve interoperability. Moreover, derived from our interpretable technique, {\sc Bias Neuron Suppression (BNS)} is further proposed to mitigate social biases. By studying BERT, RoBERTa, and their attributable differences from debiased FairBERTa, IG$^2$ allows us to locate and suppress identified neurons, and further mitigate undesired behaviors. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Optimizing Large Model Training through Overlapped Activation Recomputation
Authors:
Ping Chen,
Wenjie Zhang,
Shuibing He,
Yingjie Gu,
Zhuwei Peng,
Kexin Huang,
Xuan Zhan,
Weijian Chen,
Yi Zheng,
Zhefeng Wang,
Yanlong Yin,
Gang Chen
Abstract:
Large model training has been using recomputation to alleviate the memory pressure and pipelining to exploit the parallelism of data, tensor, and devices. The existing recomputation approaches may incur up to 40% overhead when training real-world models, e.g., the GPT model with 22B parameters. This is because they are executed on demand in the critical training path. In this paper, we design a ne…
▽ More
Large model training has been using recomputation to alleviate the memory pressure and pipelining to exploit the parallelism of data, tensor, and devices. The existing recomputation approaches may incur up to 40% overhead when training real-world models, e.g., the GPT model with 22B parameters. This is because they are executed on demand in the critical training path. In this paper, we design a new recomputation framework, Lynx, to reduce the overhead by overlapping the recomputation with communication occurring in training pipelines. It consists of an optimal scheduling algorithm (OPT) and a heuristic-based scheduling algorithm (HEU). OPT achieves a global optimum but suffers from a long search time. HEU was designed based on our observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all identical structures. HEU achieves a local optimum but reduces the search time by 99% compared to OPT. Our comprehensive evaluation using GPT models with 1.3B-20B parameters shows that both OPT and HEU outperform the state-of-the-art recomputation approaches (e.g., Megatron-LM and Checkmake) by 1.02-1.53x. HEU achieves a similar performance as OPT with a search time of 0.16s on average.
△ Less
Submitted 27 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models
Authors:
Yu Liu,
Lang Gao,
Mingxin Yang,
Yu Xie,
Ping Chen,
Xiaojin Zhang,
Wei Chen
Abstract:
Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challe…
▽ More
Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challenges in vulnerability analysis, our study introduces a new benchmark, VulDetectBench, specifically designed to assess the vulnerability detection capabilities of LLMs. The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty. We evaluate the performance of 17 models (both open- and closed-source) and find that while existing models can achieve over 80% accuracy on tasks related to vulnerability identification and classification, they still fall short on specific, more detailed vulnerability analysis tasks, with less than 30% accuracy, making it difficult to provide valuable auxiliary information for professional vulnerability mining. Our benchmark effectively evaluates the capabilities of various LLMs at different levels in the specific task of vulnerability detection, providing a foundation for future research and improvements in this critical area of code security. VulDetectBench is publicly available at https://github.com/Sweetaroo/VulDetectBench.
△ Less
Submitted 24 June, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
MOSA: Music Motion with Semantic Annotation Dataset for Cross-Modal Music Processing
Authors:
Yu-Fen Huang,
Nikki Moran,
Simon Coleman,
Jon Kelly,
Shun-Hwa Wei,
Po-Yin Chen,
Yun-Hsin Huang,
Tsung-Ping Chen,
Yu-Chia Kuo,
Yu-Chi Wei,
Chih-Hsuan Li,
Da-Yu Huang,
Hsuan-Kai Kao,
Ting-Wei Lin,
Li Su
Abstract:
In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music m…
▽ More
In cross-modal music processing, translation between visual, auditory, and semantic content opens up new possibilities as well as challenges. The construction of such a transformative scheme depends upon a benchmark corpus with a comprehensive data infrastructure. In particular, the assembly of a large-scale cross-modal dataset presents major challenges. In this paper, we present the MOSA (Music mOtion with Semantic Annotation) dataset, which contains high quality 3-D motion capture data, aligned audio recordings, and note-by-note semantic annotations of pitch, beat, phrase, dynamic, articulation, and harmony for 742 professional music performances by 23 professional musicians, comprising more than 30 hours and 570 K notes of data. To our knowledge, this is the largest cross-modal music dataset with note-level annotations to date. To demonstrate the usage of the MOSA dataset, we present several innovative cross-modal music information retrieval (MIR) and musical content generation tasks, including the detection of beats, downbeats, phrase, and expressive contents from audio, video and motion data, and the generation of musicians' body motion from given music audio. The dataset and codes are available alongside this publication (https://github.com/yufenhuang/MOSA-Music-mOtion-and-Semantic-Annotation-dataset).
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection
Authors:
Wei Li,
Pin-Yu Chen,
Sijia Liu,
Ren Wang
Abstract:
Deep neural networks are susceptible to backdoor attacks, where adversaries manipulate model predictions by inserting malicious samples into the training data. Currently, there is still a lack of direct filtering methods for identifying suspicious training data to unveil potential backdoor samples. In this paper, we propose a novel method, Prediction Shift Backdoor Detection (PSBD), leveraging an…
▽ More
Deep neural networks are susceptible to backdoor attacks, where adversaries manipulate model predictions by inserting malicious samples into the training data. Currently, there is still a lack of direct filtering methods for identifying suspicious training data to unveil potential backdoor samples. In this paper, we propose a novel method, Prediction Shift Backdoor Detection (PSBD), leveraging an uncertainty-based approach requiring minimal unlabeled clean validation data. PSBD is motivated by an intriguing Prediction Shift (PS) phenomenon, where poisoned models' predictions on clean data often shift away from true labels towards certain other labels with dropout applied during inference, while backdoor samples exhibit less PS. We hypothesize PS results from neuron bias effect, making neurons favor features of certain classes. PSBD identifies backdoor training samples by computing the Prediction Shift Uncertainty (PSU), the variance in probability values when dropout layers are toggled on and off during model inference. Extensive experiments have been conducted to verify the effectiveness and efficiency of PSBD, which achieves state-of-the-art results among mainstream detection methods.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Multiplane Prior Guided Few-Shot Aerial Scene Rendering
Authors:
Zihan Gao,
Licheng Jiao,
Lingling Li,
Xu Liu,
Fang Liu,
Puhua Chen,
Yuwei Guo
Abstract:
Neural Radiance Fields (NeRF) have been successfully applied in various aerial scenes, yet they face challenges with sparse views due to limited supervision. The acquisition of dense aerial views is often prohibitive, as unmanned aerial vehicles (UAVs) may encounter constraints in perspective range and energy constraints. In this work, we introduce Multiplane Prior guided NeRF (MPNeRF), a novel ap…
▽ More
Neural Radiance Fields (NeRF) have been successfully applied in various aerial scenes, yet they face challenges with sparse views due to limited supervision. The acquisition of dense aerial views is often prohibitive, as unmanned aerial vehicles (UAVs) may encounter constraints in perspective range and energy constraints. In this work, we introduce Multiplane Prior guided NeRF (MPNeRF), a novel approach tailored for few-shot aerial scene rendering-marking a pioneering effort in this domain. Our key insight is that the intrinsic geometric regularities specific to aerial imagery could be leveraged to enhance NeRF in sparse aerial scenes. By investigating NeRF's and Multiplane Image (MPI)'s behavior, we propose to guide the training process of NeRF with a Multiplane Prior. The proposed Multiplane Prior draws upon MPI's benefits and incorporates advanced image comprehension through a SwinV2 Transformer, pre-trained via SimMIM. Our extensive experiments demonstrate that MPNeRF outperforms existing state-of-the-art methods applied in non-aerial contexts, by tripling the performance in SSIM and LPIPS even with three views available. We hope our work offers insights into the development of NeRF-based applications in aerial scenes with limited data.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques
Authors:
Megh Thakkar,
Quentin Fournier,
Matthew D Riemer,
Pin-Yu Chen,
Amal Zouaq,
Payel Das,
Sarath Chandar
Abstract:
Large language models are first pre-trained on trillions of tokens and then instruction-tuned or aligned to specific preferences. While pre-training remains out of reach for most researchers due to the compute required, fine-tuning has become affordable thanks to parameter-efficient methods such as LoRA and QLoRA. Alignment is known to be sensitive to the many factors involved, including the quant…
▽ More
Large language models are first pre-trained on trillions of tokens and then instruction-tuned or aligned to specific preferences. While pre-training remains out of reach for most researchers due to the compute required, fine-tuning has become affordable thanks to parameter-efficient methods such as LoRA and QLoRA. Alignment is known to be sensitive to the many factors involved, including the quantity and quality of data, the alignment method, and the adapter rank. However, there has not yet been an extensive study of their effect on downstream performance. To address this gap, we conduct an in-depth investigation of the impact of popular choices for three crucial axes: (i) the alignment dataset (HH-RLHF and BeaverTails), (ii) the alignment technique (SFT and DPO), and (iii) the model (LLaMA-1, Vicuna-v1.3, Mistral-7b, and Mistral-7b-Instruct). Our extensive setup spanning over 300 experiments reveals consistent trends and unexpected findings. We observe how more informative data helps with preference alignment, cases where supervised fine-tuning outperforms preference optimization, and how aligning to a distinct preference boosts performance on downstream tasks. Through our in-depth analyses, we put forward key guidelines to help researchers perform more effective parameter-efficient LLM alignment.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
Authors:
Lin Lu,
Hai Yan,
Zenghui Yuan,
Jiawen Shi,
Wenqi Wei,
Pin-Yu Chen,
Pan Zhou
Abstract:
Jailbreak attacks in large language models (LLMs) entail inducing the models to generate content that breaches ethical and legal norm through the use of malicious prompts, posing a substantial threat to LLM security. Current strategies for jailbreak attack and defense often focus on optimizing locally within specific algorithmic frameworks, resulting in ineffective optimization and limited scalabi…
▽ More
Jailbreak attacks in large language models (LLMs) entail inducing the models to generate content that breaches ethical and legal norm through the use of malicious prompts, posing a substantial threat to LLM security. Current strategies for jailbreak attack and defense often focus on optimizing locally within specific algorithmic frameworks, resulting in ineffective optimization and limited scalability. In this paper, we present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques, generalizing them to all possible attack surfaces. We employ directed acyclic graphs (DAGs) to position and analyze existing jailbreak attacks, defenses, and evaluation methodologies, and propose three comprehensive, automated, and logical frameworks. \texttt{AutoAttack} investigates dependencies in two lines of jailbreak optimization strategies: genetic algorithm (GA)-based attacks and adversarial-generation-based attacks, respectively. We then introduce an ensemble jailbreak attack to exploit these dependencies. \texttt{AutoDefense} offers a mixture-of-defenders approach by leveraging the dependency relationships in pre-generative and post-generative defense strategies. \texttt{AutoEvaluation} introduces a novel evaluation method that distinguishes hallucinations, which are often overlooked, from jailbreak attack and defense responses. Through extensive experiments, we demonstrate that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
RoboCoder: Robotic Learning from Basic Skills to General Tasks with Large Language Models
Authors:
Jingyao Li,
Pengguang Chen,
Sitong Wu,
Chuanyang Zheng,
Hong Xu,
Jiaya Jia
Abstract:
The emergence of Large Language Models (LLMs) has improved the prospects for robotic tasks. However, existing benchmarks are still limited to single tasks with limited generalization capabilities. In this work, we introduce a comprehensive benchmark and an autonomous learning framework, RoboCoder aimed at enhancing the generalization capabilities of robots in complex environments. Unlike tradition…
▽ More
The emergence of Large Language Models (LLMs) has improved the prospects for robotic tasks. However, existing benchmarks are still limited to single tasks with limited generalization capabilities. In this work, we introduce a comprehensive benchmark and an autonomous learning framework, RoboCoder aimed at enhancing the generalization capabilities of robots in complex environments. Unlike traditional methods that focus on single-task learning, our research emphasizes the development of a general-purpose robotic coding algorithm that enables robots to leverage basic skills to tackle increasingly complex tasks. The newly proposed benchmark consists of 80 manually designed tasks across 7 distinct entities, testing the models' ability to learn from minimal initial mastery. Initial testing revealed that even advanced models like GPT-4 could only achieve a 47% pass rate in three-shot scenarios with humanoid entities. To address these limitations, the RoboCoder framework integrates Large Language Models (LLMs) with a dynamic learning system that uses real-time environmental feedback to continuously update and refine action codes. This adaptive method showed a remarkable improvement, achieving a 36% relative improvement. Our codes will be released.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation
Authors:
Min-Jae Hwang,
Ilia Kulikov,
Benjamin Peloquin,
Hongyu Gong,
Peng-Jen Chen,
Ann Lee
Abstract:
In this paper, we propose a textless acoustic model with a self-supervised distillation strategy for noise-robust expressive speech-to-speech translation (S2ST). Recently proposed expressive S2ST systems have achieved impressive expressivity preservation performances by cascading unit-to-speech (U2S) generator to the speech-to-unit translation model. However, these systems are vulnerable to the pr…
▽ More
In this paper, we propose a textless acoustic model with a self-supervised distillation strategy for noise-robust expressive speech-to-speech translation (S2ST). Recently proposed expressive S2ST systems have achieved impressive expressivity preservation performances by cascading unit-to-speech (U2S) generator to the speech-to-unit translation model. However, these systems are vulnerable to the presence of noise in input speech, which is an assumption in real-world translation scenarios. To address this limitation, we propose a U2S generator that incorporates a distillation with no label (DINO) self-supervised training strategy into it's pretraining process. Because the proposed method captures noise-agnostic expressivity representation, it can generate qualified speech even in noisy environment. Objective and subjective evaluation results verified that the proposed method significantly improved the performance of the expressive S2ST system in noisy environments while maintaining competitive performance in clean environments.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
CoNav: A Benchmark for Human-Centered Collaborative Navigation
Authors:
Changhao Li,
Xinyu Sun,
Peihao Chen,
Jugang Fan,
Zixu Wang,
Yanxia Liu,
Jinhui Zhu,
Chuang Gan,
Mingkui Tan
Abstract:
Human-robot collaboration, in which the robot intelligently assists the human with the upcoming task, is an appealing objective. To achieve this goal, the agent needs to be equipped with a fundamental collaborative navigation ability, where the agent should reason human intention by observing human activities and then navigate to the human's intended destination in advance of the human. However, t…
▽ More
Human-robot collaboration, in which the robot intelligently assists the human with the upcoming task, is an appealing objective. To achieve this goal, the agent needs to be equipped with a fundamental collaborative navigation ability, where the agent should reason human intention by observing human activities and then navigate to the human's intended destination in advance of the human. However, this vital ability has not been well studied in previous literature. To fill this gap, we propose a collaborative navigation (CoNav) benchmark. Our CoNav tackles the critical challenge of constructing a 3D navigation environment with realistic and diverse human activities. To achieve this, we design a novel LLM-based humanoid animation generation framework, which is conditioned on both text descriptions and environmental context. The generated humanoid trajectory obeys the environmental context and can be easily integrated into popular simulators. We empirically find that the existing navigation methods struggle in CoNav task since they neglect the perception of human intention. To solve this problem, we propose an intention-aware agent for reasoning both long-term and short-term human intention. The agent predicts navigation action based on the predicted intention and panoramic observation. The emergent agent behavior including observing humans, avoiding human collision, and navigation reveals the efficiency of the proposed datasets and agents.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding
Authors:
Hongkang Li,
Meng Wang,
Tengfei Ma,
Sijia Liu,
Zaixi Zhang,
Pin-Yu Chen
Abstract:
Graph Transformers, which incorporate self-attention and positional encoding, have recently emerged as a powerful architecture for various graph learning tasks. Despite their impressive performance, the complex non-convex interactions across layers and the recursive graph structure have made it challenging to establish a theoretical foundation for learning and generalization. This study introduces…
▽ More
Graph Transformers, which incorporate self-attention and positional encoding, have recently emerged as a powerful architecture for various graph learning tasks. Despite their impressive performance, the complex non-convex interactions across layers and the recursive graph structure have made it challenging to establish a theoretical foundation for learning and generalization. This study introduces the first theoretical investigation of a shallow Graph Transformer for semi-supervised node classification, comprising a self-attention layer with relative positional encoding and a two-layer perceptron. Focusing on a graph data model with discriminative nodes that determine node labels and non-discriminative nodes that are class-irrelevant, we characterize the sample complexity required to achieve a desirable generalization error by training with stochastic gradient descent (SGD). This paper provides the quantitative characterization of the sample complexity and number of iterations for convergence dependent on the fraction of discriminative nodes, the dominant patterns, and the initial model errors. Furthermore, we demonstrate that self-attention and positional encoding enhance generalization by making the attention map sparse and promoting the core neighborhood during training, which explains the superior feature representation of Graph Transformers. Our theoretical results are supported by empirical experiments on synthetic and real-world benchmarks.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Differentially Private Fine-Tuning of Diffusion Models
Authors:
Yu-Lin Tsai,
Yizhe Li,
Zekai Chen,
Po-Yu Chen,
Chia-Mu Yu,
Xuebin Ren,
Francois Buet-Golfouse
Abstract:
The integration of Differential Privacy (DP) with diffusion models (DMs) presents a promising yet challenging frontier, particularly due to the substantial memorization capabilities of DMs that pose significant privacy risks. Differential privacy offers a rigorous framework for safeguarding individual data points during model training, with Differential Privacy Stochastic Gradient Descent (DP-SGD)…
▽ More
The integration of Differential Privacy (DP) with diffusion models (DMs) presents a promising yet challenging frontier, particularly due to the substantial memorization capabilities of DMs that pose significant privacy risks. Differential privacy offers a rigorous framework for safeguarding individual data points during model training, with Differential Privacy Stochastic Gradient Descent (DP-SGD) being a prominent implementation. Diffusion method decomposes image generation into iterative steps, theoretically aligning well with DP's incremental noise addition. Despite the natural fit, the unique architecture of DMs necessitates tailored approaches to effectively balance privacy-utility trade-off. Recent developments in this field have highlighted the potential for generating high-quality synthetic data by pre-training on public data (i.e., ImageNet) and fine-tuning on private data, however, there is a pronounced gap in research on optimizing the trade-offs involved in DP settings, particularly concerning parameter efficiency and model scalability. Our work addresses this by proposing a parameter-efficient fine-tuning strategy optimized for private diffusion models, which minimizes the number of trainable parameters to enhance the privacy-utility trade-off. We empirically demonstrate that our method achieves state-of-the-art performance in DP synthesis, significantly surpassing previous benchmarks on widely studied datasets (e.g., with only 0.47M trainable parameters, achieving a more than 35% improvement over the previous state-of-the-art with a small privacy budget on the CelebA-64 dataset). Anonymous codes available at https://anonymous.4open.science/r/DP-LORA-F02F.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers
Authors:
Pengtao Chen,
Mingzhu Shen,
Peng Ye,
Jianjian Cao,
Chongjun Tu,
Christos-Savvas Bouganis,
Yiren Zhao,
Tao Chen
Abstract:
Diffusion models are widely recognized for generating high-quality and diverse images, but their poor real-time performance has led to numerous acceleration works, primarily focusing on UNet-based structures. With the more successful results achieved by diffusion transformers (DiT), there is still a lack of exploration regarding the impact of DiT structure on generation, as well as the absence of…
▽ More
Diffusion models are widely recognized for generating high-quality and diverse images, but their poor real-time performance has led to numerous acceleration works, primarily focusing on UNet-based structures. With the more successful results achieved by diffusion transformers (DiT), there is still a lack of exploration regarding the impact of DiT structure on generation, as well as the absence of an acceleration framework tailored to the DiT architecture. To tackle these challenges, we conduct an investigation into the correlation between DiT blocks and image generation. Our findings reveal that the front blocks of DiT are associated with the outline of the generated images, while the rear blocks are linked to the details. Based on this insight, we propose an overall training-free inference acceleration framework $Δ$-DiT: using a designed cache mechanism to accelerate the rear DiT blocks in the early sampling stages and the front DiT blocks in the later stages. Specifically, a DiT-specific cache mechanism called $Δ$-Cache is proposed, which considers the inputs of the previous sampling image and reduces the bias in the inference. Extensive experiments on PIXART-$α$ and DiT-XL demonstrate that the $Δ$-DiT can achieve a $1.6\times$ speedup on the 20-step generation and even improves performance in most cases. In the scenario of 4-step consistent model generation and the more challenging $1.12\times$ acceleration, our method significantly outperforms existing methods. Our code will be publicly available.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.