subscribe to arXiv mailings

Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors

Authors: Wei Shang, Dongwei Ren, Wanying Zhang, Yuming Fang, Wangmeng Zuo, Kede Ma

Abstract: Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we first describe a strong baseline for AVSR by putting together three variants of elementary building blocks: 1) a flow-guide… ▽ More Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we first describe a strong baseline for AVSR by putting together three variants of elementary building blocks: 1) a flow-guided recurrent unit that aggregates spatiotemporal information from previous frames, 2) a flow-refined cross-attention unit that selects spatiotemporal information from future frames, and 3) a hyper-upsampling unit that generates scaleaware and content-independent upsampling kernels. We then introduce ST-AVSR by equipping our baseline with a multi-scale structural and textural prior computed from the pre-trained VGG network. This prior has proven effective in discriminating structure and texture across different locations and scales, which is beneficial for AVSR. Comprehensive experiments show that ST-AVSR significantly improves super-resolution quality, generalization ability, and inference speed over the state-of-theart. The code is available at https://github.com/shangwei5/ST-AVSR. △ Less

Submitted 13 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV 2024, the code is available at https://github.com/shangwei5/ST-AVSR

ACM Class: I.4.3

arXiv:2407.09734 [pdf, ps, other]

Ab initio study of Z(N) = 6 magicity

Authors: H. Li, H. J. Ong, D. Fang, I. A. Mazur, I. J. Shin, A. M. Shirokov, J. P. Vary, P. Yin, X. Zhao, W. Zuo

Abstract: The existence of magic numbers of protons and neutrons in nuclei is essential for understanding nuclear structure and fundamental nuclear forces. Over decades, researchers have conducted theoretical and experimental studies on the new magic number Z(N) = 6, focusing on observables such as radii, binding energy, electromagnetic transition, and nucleon separation energies. We perform the ab initio n… ▽ More The existence of magic numbers of protons and neutrons in nuclei is essential for understanding nuclear structure and fundamental nuclear forces. Over decades, researchers have conducted theoretical and experimental studies on the new magic number Z(N) = 6, focusing on observables such as radii, binding energy, electromagnetic transition, and nucleon separation energies. We perform the ab initio no-core shell model calculations for the occupation numbers of the lowest single particle states in the ground states of Z(N) = 6 and Z(N) = 8 isotopes (isotones). Our calculations do not support Z(N) = 6 as a magic number over a span of atomic numbers. However, 14C and 14O exhibit the characteristics of double-magic nuclei. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.07518 [pdf, other]

Multi-modal Crowd Counting via a Broker Modality

Authors: Haoliang Meng, Xiaopeng Hong, Chenhao Wang, Miao Shang, Wangmeng Zuo

Abstract: Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. This task is challenging due to the significant gap between these distinct modalities. In this paper, we propose a novel approach by introducing an auxiliary broker modality and on this basis frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this brok… ▽ More Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. This task is challenging due to the significant gap between these distinct modalities. In this paper, we propose a novel approach by introducing an auxiliary broker modality and on this basis frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models. Additionally, we identify and address the ghosting effect caused by direct cross-modal image fusion in multi-modal crowd counting. Through extensive experimental evaluations on popular multi-modal crowd-counting datasets, we demonstrate the effectiveness of our method, which introduces only 4 million additional parameters, yet achieves promising results. The code is available at https://github.com/HenryCilence/Broker-Modality-Crowd-Counting. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: This is the preprint version of the paper and supplemental material to appear in ECCV 2024. Please cite the final published version. Code is available at https://github.com/HenryCilence/Broker-Modality-Crowd-Counting

arXiv:2407.03422 [pdf, ps, other]

Rest-Frame Optical Spectroscopy of Ten z $\sim$ 2 Weak Emission-Line Quasars

Authors: Ying Chen, Bin Luo, W. N. Brandt, Wenwen Zuo, Cooper Dix, Trung Ha, Brandon Matthews, Jeremiah D. Paul, Richard M. Plotkin, Ohad Shemmer

Abstract: We present near-infrared spectroscopy of ten weak emission-line quasars (WLQs) at redshifts of $z\sim2$, obtained with the Palomar 200-inch Hale Telescope. WLQs are an exceptional population of type 1 quasars that exhibit weak or no broad emission lines in the ultraviolet (e.g., the C IV $λ1549$ line), and they display remarkable X-ray properties. We derive H$β$-based single-epoch virial black-hol… ▽ More We present near-infrared spectroscopy of ten weak emission-line quasars (WLQs) at redshifts of $z\sim2$, obtained with the Palomar 200-inch Hale Telescope. WLQs are an exceptional population of type 1 quasars that exhibit weak or no broad emission lines in the ultraviolet (e.g., the C IV $λ1549$ line), and they display remarkable X-ray properties. We derive H$β$-based single-epoch virial black-hole masses (median value $\rm 1.7 \times 10^{9} M_{\odot}$) and Eddington ratios (median value $0.5)$ for our sources. We confirm the previous finding that WLQ H$β$ lines, as a major low-ionization line, are not significantly weak compared to typical quasars. The most prominent feature of the WLQ optical spectra is the universally weak/absent [O III] $λ5007$ emission. They also display stronger optical Fe II emission than typical quasars. Our results favor the super-Eddington accretion scenario for WLQs, where the weak lines are a result of a soft ionizing continuum; the geometrically thick inner accretion disk and/or its associated outflow is responsible for obscuring the nuclear high-energy radiation and producing the soft ionizing continuum. We also report candidate extreme [O III] outflows (blueshifts of $\approx 500$ and $\rm 4900 km s^{-1}$) in one object. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 19 pages, 8 figures, accepted for publication in ApJ

arXiv:2407.01155 [pdf, other]

CPT: Consistent Proxy Tuning for Black-box Optimization

Authors: Yuanyang He, Zitong Huang, Xinxing Xu, Rick Siow Mong Goh, Salman Khan, Wangmeng Zuo, Yong Liu, Chun-Mei Feng

Abstract: Black-box tuning has attracted recent attention due to that the structure or inner parameters of advanced proprietary models are not accessible. Proxy-tuning provides a test-time output adjustment for tuning black-box language models. It applies the difference of the output logits before and after tuning a smaller white-box "proxy" model to improve the black-box model. However, this technique serv… ▽ More Black-box tuning has attracted recent attention due to that the structure or inner parameters of advanced proprietary models are not accessible. Proxy-tuning provides a test-time output adjustment for tuning black-box language models. It applies the difference of the output logits before and after tuning a smaller white-box "proxy" model to improve the black-box model. However, this technique serves only as a decoding-time algorithm, leading to an inconsistency between training and testing which potentially limits overall performance. To address this problem, we introduce Consistent Proxy Tuning (CPT), a simple yet effective black-box tuning method. Different from Proxy-tuning, CPT additionally exploits the frozen large black-box model and another frozen small white-box model, ensuring consistency between training-stage optimization objective and test-time proxies. This consistency benefits Proxy-tuning and enhances model performance. Note that our method focuses solely on logit-level computation, which makes it model-agnostic and applicable to any task involving logit classification. Extensive experimental results demonstrate the superiority of our CPT in both black-box tuning of Large Language Models (LLMs) and Vision-Language Models (VLMs) across various datasets. The code is available at https://github.com/chunmeifeng/CPT. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 10 pages,2 figures plus supplementary materials

arXiv:2407.01094 [pdf, other]

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Authors: Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wangmeng Zuo, Qixiang Ye, Jingdong Wang

Abstract: Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to… ▽ More Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. Code is available at https://github.com/MingXiangL/DEVIL. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00884 [pdf, other]

Mechanisms of mirror energy difference for states exhibiting Thomas-Ehrman shift: Gamow shell model case studies of $^{18}$Ne/$^{18}$O and $^{19}$Na/$^{19}$O

Authors: J. G. Li, K. H. Li, N. Michel, H. H. Li, W. Zuo

Abstract: The mirror energy difference (MED) of the mirror state, especially for states bearing the Thomas-Erhman shift, serves as a sensitive probe of isospin symmetry breaking. We employ the Gamow shell model, which includes the inter-nucleon correlation and continuum coupling, to investigate the MED for $sd$-shell nuclei by taking the $^{18}$Ne/$^{18}$O and $^{19}$Na/$^{19}$O as examples. Our GSM provide… ▽ More The mirror energy difference (MED) of the mirror state, especially for states bearing the Thomas-Erhman shift, serves as a sensitive probe of isospin symmetry breaking. We employ the Gamow shell model, which includes the inter-nucleon correlation and continuum coupling, to investigate the MED for $sd$-shell nuclei by taking the $^{18}$Ne/$^{18}$O and $^{19}$Na/$^{19}$O as examples. Our GSM provides good descriptions for the excitation energies and MEDs for the $^{18}$Ne/$^{18}$O and $^{19}$Na/$^{19}$O. Moreover, our calculations also reveal that the large MED of the mirror states is caused by the significant occupation of the weakly bound or unbound $s_{1/2}$ waves, giving the radial density distribution of the state in the proton-rich nucleus more extended than that of mirror states in deeply-bound neutron-rich nuclei. Furthermore, our GSM calculation shows that the contribution of Coulomb is different for the low-lying states in proton-rich nuclei, which significantly contributes to MEDs of mirror states. Moreover, the contributions of the nucleon-nucleon interaction are different for the mirror state, especially for the state of proton-rich nuclei bearing the Thomas-Erhman shift, which also contributes to the significant isospin symmetry breaking with large MED. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: 6 Pages,6 figures

arXiv:2406.14207 [pdf, other]

LayerMatch: Do Pseudo-labels Benefit All Layers?

Authors: Chaoqi Liang, Guanglei Yang, Lifeng Qiao, Zitong Huang, Hongliang Yan, Yunchao Wei, Wangmeng Zuo

Abstract: Deep neural networks have achieved remarkable performance across various tasks when supplied with large-scale labeled data. However, the collection of labeled data can be time-consuming and labor-intensive. Semi-supervised learning (SSL), particularly through pseudo-labeling algorithms that iteratively assign pseudo-labels for self-training, offers a promising solution to mitigate the dependency o… ▽ More Deep neural networks have achieved remarkable performance across various tasks when supplied with large-scale labeled data. However, the collection of labeled data can be time-consuming and labor-intensive. Semi-supervised learning (SSL), particularly through pseudo-labeling algorithms that iteratively assign pseudo-labels for self-training, offers a promising solution to mitigate the dependency of labeled data. Previous research generally applies a uniform pseudo-labeling strategy across all model layers, assuming that pseudo-labels exert uniform influence throughout. Contrasting this, our theoretical analysis and empirical experiment demonstrate feature extraction layer and linear classification layer have distinct learning behaviors in response to pseudo-labels. Based on these insights, we develop two layer-specific pseudo-label strategies, termed Grad-ReLU and Avg-Clustering. Grad-ReLU mitigates the impact of noisy pseudo-labels by removing the gradient detrimental effects of pseudo-labels in the linear classification layer. Avg-Clustering accelerates the convergence of feature extraction layer towards stable clustering centers by integrating consistent outputs. Our approach, LayerMatch, which integrates these two strategies, can avoid the severe interference of noisy pseudo-labels in the linear classification layer while accelerating the clustering capability of the feature extraction layer. Through extensive experimentation, our approach consistently demonstrates exceptional performance on standard semi-supervised learning benchmarks, achieving a significant improvement of 10.38% over baseline method and a 2.44% increase compared to state-of-the-art methods. △ Less

Submitted 27 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.11138 [pdf, other]

Diffusion Models in Low-Level Vision: A Survey

Authors: Chunming He, Yuqi Shen, Chengyu Fang, Fengyang Xiao, Longxiang Tang, Yulun Zhang, Wangmeng Zuo, Zhenhua Guo, Xiu Li

Abstract: Deep generative models have garnered significant attention in low-level vision tasks due to their generative capabilities. Among them, diffusion model-based solutions, characterized by a forward diffusion process and a reverse denoising process, have emerged as widely acclaimed for their ability to produce samples of superior quality and diversity. This ensures the generation of visually compellin… ▽ More Deep generative models have garnered significant attention in low-level vision tasks due to their generative capabilities. Among them, diffusion model-based solutions, characterized by a forward diffusion process and a reverse denoising process, have emerged as widely acclaimed for their ability to produce samples of superior quality and diversity. This ensures the generation of visually compelling results with intricate texture information. Despite their remarkable success, a noticeable gap exists in a comprehensive survey that amalgamates these pioneering diffusion model-based works and organizes the corresponding threads. This paper proposes the comprehensive review of diffusion model-based techniques. We present three generic diffusion modeling frameworks and explore their correlations with other deep generative models, establishing the theoretical foundation. Following this, we introduce a multi-perspective categorization of diffusion models, considering both the underlying framework and the target task. Additionally, we summarize extended diffusion models applied in other tasks, including medical, remote sensing, and video scenarios. Moreover, we provide an overview of commonly used benchmarks and evaluation metrics. We conduct a thorough evaluation, encompassing both performance and efficiency, of diffusion model-based techniques in three prominent tasks. Finally, we elucidate the limitations of current diffusion models and propose seven intriguing directions for future research. This comprehensive examination aims to facilitate a profound understanding of the landscape surrounding denoising diffusion models in the context of low-level vision tasks. A curated list of diffusion model-based techniques in over 20 low-level vision tasks can be found at https://github.com/ChunmingHe/awesome-diffusion-models-in-low-level-vision. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 20 pages, 23 figures, 4 tables

arXiv:2406.08151 [pdf, other]

Unveiling potential neutron halos in intermediate-mass nuclei: an \textit{ab initio} study

Authors: H. H. Li, J. G. Li, M. R. Xie, W. Zuo

Abstract: Halos epitomize the fascinating interplay between weak binding, shell evolution, and deformation effects, especially in nuclei near the drip line. In this Letter, we apply the state-of-the-art \textit{ab initio} valence-space in-medium similarity renormalization group approach to predict potential candidates for one- and two-neutron halo in the intermediate-mass region. Notably, we use spectroscop… ▽ More Halos epitomize the fascinating interplay between weak binding, shell evolution, and deformation effects, especially in nuclei near the drip line. In this Letter, we apply the state-of-the-art \textit{ab initio} valence-space in-medium similarity renormalization group approach to predict potential candidates for one- and two-neutron halo in the intermediate-mass region. Notably, we use spectroscopic factors (SF) and two-nucleon amplitudes (TNA) as criteria for suggesting one- and two-neutron halo candidates, respectively. This approach is not only theoretically sound but also amenable to experimental validation. Our research focuses on Mg, Al, Si, P, and S neutron-drip-line nuclei, offering systematic predictions of neutron halo candidates in terms of separation energies, SF (TNA), and average occupation. The calculation suggests the ground states of $^{40,42,44,46}$Al, $^{41,43,45,47}$Si, $^{46,48}$P, and $^{47,49}$S are promising candidates for one-neutron halos, while $^{40,42,44,46}$Mg, $^{45,47}$Al, $^{46,48}$Si, $^{49}$P, and $^{50}$S may harbor two-neutron halos. In addition, the relative mean-square neutron radius between halo nuclei and \textit{inner core} is calculated for suggested potential neutron halos. Finally, the relations of halo formations and shell evolution are discussed. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07956 [pdf, other]

Ab initio calculations with a new local chiral N3LO nucleon-nucleon force

Authors: P. Y. Wang, J. G. Li, S. Zhang, Q. Yuan, M. R. Xie, W. Zuo

Abstract: Ab initio calculations have achieved remarkable success in nuclear structure studies. Numerous works highlight the pivotal role of three-body forces in nuclear ab initio calculations. Concurrently, efforts have been made to replicate these calculations using only realistic nucleon-nucleon (NN) interactions. A novel local chiral next-to-next-to-next-to-leading order (N3LO) NN interaction, distinct… ▽ More Ab initio calculations have achieved remarkable success in nuclear structure studies. Numerous works highlight the pivotal role of three-body forces in nuclear ab initio calculations. Concurrently, efforts have been made to replicate these calculations using only realistic nucleon-nucleon (NN) interactions. A novel local chiral next-to-next-to-next-to-leading order (N3LO) NN interaction, distinct due to its weaker tensor force, has recently been established. This paper applies this local NN interaction in ab initio frameworks to calculate the low-lying spectra of p-shell light nuclei, particularly 10B, ground-state energies and shell evolution in oxygen isotopes. Results are compared with calculations utilizing nonlocal chiral N3LO NN and chiral NN +3N interactions. The ab initio calculations with the local N N potential accurately describe the spectra of p-shell nuclei, notably the 10B. Additionally, the neutron drip line for oxygen isotopes, with 24O as the drip line nucleus, is accurately reproduced in ab initio calculations with the local NN interaction. Calculations with the local NN interaction also reproduce the subshell closure at N = 14 and 16, albeit with a stronger shell gap compared to experimental data. However, the calculated charge radii based on the local NN interaction are underestimated compared with experimental data, which is similar to results from the nonlocal NN interaction. Consequently, the present ab initio calculations further indicate significant spin-orbit splitting effects with the new local NN potential, suggesting that 3N forces remain an important consideration. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 9 pages, 7 figures

arXiv:2406.07487 [pdf, other]

GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection

Authors: Hang Yao, Ming Liu, Haolin Wang, Zhicun Yin, Zifei Yan, Xiaopeng Hong, Wangmeng Zuo

Abstract: Diffusion models have shown superior performance on unsupervised anomaly detection tasks. Since trained with normal data only, diffusion models tend to reconstruct normal counterparts of test images with certain noises added. However, these methods treat all potential anomalies equally, which may cause two main problems. From the global perspective, the difficulty of reconstructing images with dif… ▽ More Diffusion models have shown superior performance on unsupervised anomaly detection tasks. Since trained with normal data only, diffusion models tend to reconstruct normal counterparts of test images with certain noises added. However, these methods treat all potential anomalies equally, which may cause two main problems. From the global perspective, the difficulty of reconstructing images with different anomalies is uneven. Therefore, instead of utilizing the same setting for all samples, we propose to predict a particular denoising step for each sample by evaluating the difference between image contents and the priors extracted from diffusion models. From the local perspective, reconstructing abnormal regions differs from normal areas even in the same image. Theoretically, the diffusion model predicts a noise for each step, typically following a standard Gaussian distribution. However, due to the difference between the anomaly and its potential normal counterpart, the predicted noise in abnormal regions will inevitably deviate from the standard Gaussian distribution. To this end, we propose introducing synthetic abnormal samples in training to encourage the diffusion models to break through the limitation of standard Gaussian distribution, and a spatial-adaptive feature fusion scheme is utilized during inference. With the above modifications, we propose a global and local adaptive diffusion model (abbreviated to GLAD) for unsupervised anomaly detection, which introduces appealing flexibility and achieves anomaly-free reconstruction while retaining as much normal information as possible. Extensive experiments are conducted on three commonly used anomaly detection datasets (MVTec-AD, MPDD, and VisA) and a printed circuit board dataset (PCB-Bank) we integrated, showing the effectiveness of the proposed method. △ Less

Submitted 2 July, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted by ECCV 2024, code and models: https://github.com/hyao1/GLAD. Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract here is shorter than that in the PDF file

arXiv:2406.05669 [pdf, other]

Ab initio valence-space in-medium similarity renormalization group calculations for neutron-rich P, Cl, and K isotopes

Authors: M. R. Xie, L. Y. Shen, J. G. Li, H. H. Li, Q. Yuan, W. Zuo

Abstract: Neutron-rich P, Cl, and K isotopes, particularly those with neutron numbers around $N=28$, have attracted extensive experimental and theoretical interest. We utilize the \textit{ab initio} valence-space in-medium similarity renormalization group approach, based on chiral nucleon-nucleon and three-nucleon forces, to investigate the exotic properties of these isotopes. Systematic calculations of the… ▽ More Neutron-rich P, Cl, and K isotopes, particularly those with neutron numbers around $N=28$, have attracted extensive experimental and theoretical interest. We utilize the \textit{ab initio} valence-space in-medium similarity renormalization group approach, based on chiral nucleon-nucleon and three-nucleon forces, to investigate the exotic properties of these isotopes. Systematic calculations of the low-lying spectra are performed. A key finding is the level inversion between $3/2_1^+$ and $1/2_1^+$ states in odd-$A$ isotopes, attributed to the inversion of $π0d_{3/2}$ and $π1s_{1/2}$ single-particle states.\textit{Ab initio} calculations, which incorporate the three-nucleon forces, correlate closely with existing experimental data. Further calculations of effective proton single-particle energies provide deeper insights into the shell evolution for $Z=14$ and $16$ sub-shells. Our results indicate that the three-body force plays important roles in the shell evolution for $Z=14$ and $16$ sub-shells with neutron numbers ranging from 20 to 28. Additionally, systematic \textit{ab initio} calculations are conducted for the low-lying spectra of odd-odd nuclei. The results align with experimental data and provide new insights for future research into these isotopes, up to and beyond the drip line. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: 12 pages, 9 figures, 1 table

arXiv:2406.05640 [pdf, other]

doi 10.1007/s11433-023-2227-5

Spectroscopic factors of resonance states with the Gamow shell model

Authors: M. R. Xie, J. G. Li, N. Michel, H. H. Li, W. Zuo

Abstract: We provide an investigation of the spectroscopic factor of resonance states in $A =5-8$ nuclei, utilizing the Gamow shell model (GSM). Within the GSM, the configuration mixing is taken into account exactly with the shell model framework, and the continuum coupling is addressed via the complex-energy Berggren ensemble, which treats bound, resonance, and non-resonant continuum single-particle states… ▽ More We provide an investigation of the spectroscopic factor of resonance states in $A =5-8$ nuclei, utilizing the Gamow shell model (GSM). Within the GSM, the configuration mixing is taken into account exactly with the shell model framework, and the continuum coupling is addressed via the complex-energy Berggren ensemble, which treats bound, resonance, and non-resonant continuum single-particle states on an equal footing. As a result, both the configuration mixing and continuum coupling are meticulously considered in the GSM. We first calculate the low-lying states of helium isotopes and isotones with the GSM, and the results are compared with that of \textit{ab initio} no-core shell model (NCSM) calculations. The results indicate that GSM can reproduce the low-lying resonance states more accurately than the no-core shell model. Following this, we delve into the spectroscopic factors of the resonance states as computed through both GSM and NCSM, concurrently conducting systematic calculations of overlap functions pertinent to these resonance states. Finally, the calculated overlap function and spectroscopic factor of $^6$He$(0_1^+)$ $\otimes νp_{3/2} \to $ $^7$He$(3/2_1^-)$ with GSM are compared with the results from \textit{ab initio} NCSM, variational Monte Carlo, and Green's function Monte Carlo calculations, as well as avaliable experimental data. The results assert that wave function asymptotes can only be reproduced in GSM, where resonance and continuum coupling are precisely addressed. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: 7 pages, 3 figures, 1 table

arXiv:2406.01476 [pdf, other]

DreamPhysics: Learning Physical Properties of Dynamic 3D Gaussians with Video Diffusion Priors

Authors: Tianyu Huang, Yihan Zeng, Hui Li, Wangmeng Zuo, Rynson W. H. Lau

Abstract: Dynamic 3D interaction has witnessed great interest in recent works, while creating such 4D content remains challenging. One solution is to animate 3D scenes with physics-based simulation, and the other is to learn the deformation of static 3D objects with the distillation of video generative models. The former one requires assigning precise physical properties to the target object, otherwise the… ▽ More Dynamic 3D interaction has witnessed great interest in recent works, while creating such 4D content remains challenging. One solution is to animate 3D scenes with physics-based simulation, and the other is to learn the deformation of static 3D objects with the distillation of video generative models. The former one requires assigning precise physical properties to the target object, otherwise the simulated results would become unnatural. The latter tends to formulate the video with minor motions and discontinuous frames, due to the absence of physical constraints in deformation learning. We think that video generative models are trained with real-world captured data, capable of judging physical phenomenon in simulation environments. To this end, we propose DreamPhysics in this work, which estimates physical properties of 3D Gaussian Splatting with video diffusion priors. DreamPhysics supports both image- and text-conditioned guidance, optimizing physical parameters via score distillation sampling with frame interpolation and log gradient. Based on a material point method simulator with proper physical parameters, our method can generate 4D content with realistic motions. Experimental results demonstrate that, by distilling the prior knowledge of video diffusion models, inaccurate physical properties can be gradually refined for high-quality simulation. Codes are released at: https://github.com/tyhuang0428/DreamPhysics. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: Technical report. Codes are released at: https://github.com/tyhuang0428/DreamPhysics

arXiv:2405.20778 [pdf, other]

Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Authors: Qizhang Li, Yiwen Guo, Wangmeng Zuo, Hao Chen

Abstract: Despite numerous efforts to ensure large language models (LLMs) adhere to safety standards and produce harmless content, some successes have been achieved in bypassing these restrictions, known as jailbreak attacks against LLMs. Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing jailbreak attacks automatically. Nevertheless, due to the discrete… ▽ More Despite numerous efforts to ensure large language models (LLMs) adhere to safety standards and produce harmless content, some successes have been achieved in bypassing these restrictions, known as jailbreak attacks against LLMs. Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing jailbreak attacks automatically. Nevertheless, due to the discrete nature of texts, the input gradient of LLMs struggles to precisely reflect the magnitude of loss change that results from token replacements in the prompt, leading to limited attack success rates against safety-aligned LLMs, even in the white-box setting. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks that were originally proposed for attacking black-box image classification models. For the first time, we appropriate the ideologies of effective methods among these transfer-based attacks, i.e., Skip Gradient Method and Intermediate Level Attack, for improving the effectiveness of automatically generated adversarial examples against white-box LLMs. With appropriate adaptations, we inject these ideologies into gradient-based adversarial prompt generation processes and achieve significant performance gains without introducing obvious computational cost. Meanwhile, by discussing mechanisms behind the gains, new insights are drawn, and proper combinations of these methods are also developed. Our empirical results show that the developed combination achieves >30% absolute increase in attack success rates compared with GCG for attacking the Llama-2-7B-Chat model on AdvBench. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.19732 [pdf, other]

Two Optimizers Are Better Than One: LLM Catalyst Empowers Gradient-Based Optimization for Prompt Tuning

Authors: Zixian Guo, Ming Liu, Zhilong Ji, Jinfeng Bai, Yiwen Guo, Wangmeng Zuo

Abstract: Learning a skill generally relies on both practical experience by doer and insightful high-level guidance by instructor. Will this strategy also work well for solving complex non-convex optimization problems? Here, a common gradient-based optimizer acts like a disciplined doer, making locally optimal update at each step. Recent methods utilize large language models (LLMs) to optimize solutions for… ▽ More Learning a skill generally relies on both practical experience by doer and insightful high-level guidance by instructor. Will this strategy also work well for solving complex non-convex optimization problems? Here, a common gradient-based optimizer acts like a disciplined doer, making locally optimal update at each step. Recent methods utilize large language models (LLMs) to optimize solutions for concrete problems by inferring from natural language instructions, akin to a high-level instructor. In this paper, we show that these two optimizers are complementary to each other, suggesting a collaborative optimization approach. The gradient-based optimizer and LLM-based optimizer are combined in an interleaved manner. We instruct LLMs using task descriptions and timely optimization trajectories recorded during gradient-based optimization. Inferred results from LLMs are used as restarting points for the next stage of gradient optimization. By leveraging both the locally rigorous gradient-based optimizer and the high-level deductive LLM-based optimizer, our combined optimization method consistently yields improvements over competitive baseline prompt tuning methods. Our results demonstrate the synergistic effect of conventional gradient-based optimization and the inference ability of LLMs. The code is released at https://github.com/guozix/LLM-catalyst. △ Less

Submitted 6 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.11750 [pdf, other]

The Intermediate-Mass Black Hole Reverberation Mapping Project: Initial Results for a candidate IMBH in a nearby Seyfert 1 Galaxy

Authors: Wenwen Zuo, Hengxiao Guo, Jingbo Sun, Qi Yuan, Paulina Lira, Minfeng Gu, Philip G. Edwards, Alok C. Gupta, Shubham Kishore, Jamie Stevens, Tao An, Zhen-Yi Cai, Haicheng Feng, Luis C. Ho, Dragana Ilić, Andjelka B. Kovačević, ShaSha Li, Mar Mezcua, Luka Č. Popović, Mouyuan Sun, Tushar Tripathi, Vivian U., Oliver Vince, Jianguo Wang, Junxian Wang , et al. (3 additional authors not shown)

Abstract: To investigate the short-term variability and determine the size of the optical continuum emitting size of intermediate-mass black holes (IMBHs), we carried out high-cadence, multi-band photometric monitoring of a Seyfert 1 galaxy J0249-0815 across two nights, together with a one-night single-band preliminary test. The presence of the broad Ha component in our target was confirmed by recent Paloma… ▽ More To investigate the short-term variability and determine the size of the optical continuum emitting size of intermediate-mass black holes (IMBHs), we carried out high-cadence, multi-band photometric monitoring of a Seyfert 1 galaxy J0249-0815 across two nights, together with a one-night single-band preliminary test. The presence of the broad Ha component in our target was confirmed by recent Palomar/P200 spectroscopic observations, 23 years after Sloan Digital Sky Survey, ruling out the supernovae origin of the broad Ha line. The photometric experiment was primarily conducted utilizing four-channel imagers MuSCAT 3 & 4 mounted on 2-meter telescopes within the Las Cumbres Observatory Global Telescope Network. Despite the expectation of variability, we observed no significant variation (<1.4%) on timescales of 6-10 hours. This non-detection is likely due to substantial host galaxy light diluting the subtle AGN variability. Dual-band preliminary tests and tailored simulations may enhance the possibility of detecting variability and lag in future IMBH reverberation campaigns. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: 14 pages, 6 figures, submitted to ApJ, comments welcome

arXiv:2405.09799 [pdf, ps, other]

Direct ab initio calculation of the $^{4}$He nuclear electric dipole polarizability

Authors: Peng Yin, Andrey M. Shirokov, Pieter Maris, Patrick J. Fasano, Mark A. Caprio, He Li, Wei Zuo, James P. Vary

Abstract: The calculation of nuclear electromagnetic sum rules by directly diagonalizing the nuclear Hamiltonian in a large basis is numerically challenging and has not been performed for $A>2$ nuclei. With the significant progress of high performance computing, we show that calculating sum rules using numerous discretized continuum states obtained by directly diagonalizing the ab initio no-core shell model… ▽ More The calculation of nuclear electromagnetic sum rules by directly diagonalizing the nuclear Hamiltonian in a large basis is numerically challenging and has not been performed for $A>2$ nuclei. With the significant progress of high performance computing, we show that calculating sum rules using numerous discretized continuum states obtained by directly diagonalizing the ab initio no-core shell model Hamiltonian is achievable numerically. Specifically, we calculate the $^{4}$He electric dipole ($E1$) polarizability, that is an inverse energy weighted sum rule, employing the Daejeon16 $NN$ interaction. We demonstrate that the calculations are numerically tractable as the dimension of the basis increases and are convergent. Our results for the $^{4}$He electric dipole polarizability are consistent with the most recent experimental data and are compared with those of other theoretical studies employing different techniques and various interactions. △ Less

Submitted 15 July, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.09555 [pdf, ps, other]

Analysis of Near-Field Effects, Spatial Non-Stationary Characteristics Based on 11-15 GHz Channel Measurement in Indoor Scenario

Authors: Haiyang Miao, Pan Tang, Weirang Zuo, Qi Wei, Lei Tian, Jianhua Zhang

Abstract: In the sixth-generation (6G), with the further expansion of array element number and frequency bands, the wireless communications are expected to operate in the near-field region. The near-field radio communications (NFRC) will become crucial in 6G communication systems. The new mid-band (6-24 GHz) is the 6G potential candidate spectrum. In this paper, we will investigate the channel measurements… ▽ More In the sixth-generation (6G), with the further expansion of array element number and frequency bands, the wireless communications are expected to operate in the near-field region. The near-field radio communications (NFRC) will become crucial in 6G communication systems. The new mid-band (6-24 GHz) is the 6G potential candidate spectrum. In this paper, we will investigate the channel measurements and characteristics for the emerging NFRC. First, the near-field spherical-wave signal model is derived in detail, and the stationary interval (SI) division method is discussed based on the channel statistical properties. Then, the influence of line-of-sight (LOS) and obstructed-LOS (OLOS) environments on the near-field effects and spatial non-stationary (SnS) characteristic are explored based on the near-field channel measurements at 11-15 GHz band. We hope that this work will give some reference to the NFRC research. △ Less

Submitted 19 April, 2024; originally announced May 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2404.17270

arXiv:2405.08589 [pdf, other]

Variable Substitution and Bilinear Programming for Aligning Partially Overlapping Point Sets

Authors: Wei Lian, Zhesen Cui, Fei Ma, Hang Pan, Wangmeng Zuo

Abstract: In many applications, the demand arises for algorithms capable of aligning partially overlapping point sets while remaining invariant to the corresponding transformations. This research presents a method designed to meet such requirements through minimization of the objective function of the robust point matching (RPM) algorithm. First, we show that the RPM objective is a cubic polynomial. Then, t… ▽ More In many applications, the demand arises for algorithms capable of aligning partially overlapping point sets while remaining invariant to the corresponding transformations. This research presents a method designed to meet such requirements through minimization of the objective function of the robust point matching (RPM) algorithm. First, we show that the RPM objective is a cubic polynomial. Then, through variable substitution, we transform the RPM objective to a quadratic function. Leveraging the convex envelope of bilinear monomials, we proceed to relax the resulting objective function, thus obtaining a lower bound problem that can be conveniently decomposed into distinct linear assignment and low-dimensional convex quadratic program components, both amenable to efficient optimization. Furthermore, a branch-and-bound (BnB) algorithm is devised, which solely branches over the transformation parameters, thereby boosting convergence rate. Empirical evaluations demonstrate better robustness of the proposed methodology against non-rigid deformation, positional noise, and outliers, particularly in scenarios where outliers remain distinct from inliers, when compared with prevailing state-of-the-art approaches. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2405.05806 [pdf, other]

MasterWeaver: Taming Editability and Identity for Personalized Text-to-Image Generation

Authors: Yuxiang Wei, Zhilong Ji, Jinfeng Bai, Hongzhi Zhang, Lei Zhang, Wangmeng Zuo

Abstract: Text-to-image (T2I) diffusion models have shown significant success in personalized text-to-image generation, which aims to generate novel images with human identities indicated by the reference images. Despite promising identity fidelity has been achieved by several tuning-free methods, they usually suffer from overfitting issues. The learned identity tends to entangle with irrelevant information… ▽ More Text-to-image (T2I) diffusion models have shown significant success in personalized text-to-image generation, which aims to generate novel images with human identities indicated by the reference images. Despite promising identity fidelity has been achieved by several tuning-free methods, they usually suffer from overfitting issues. The learned identity tends to entangle with irrelevant information, resulting in unsatisfied text controllability, especially on faces. In this work, we present MasterWeaver, a test-time tuning-free method designed to generate personalized images with both faithful identity fidelity and flexible editability. Specifically, MasterWeaver adopts an encoder to extract identity features and steers the image generation through additional introduced cross attention. To improve editability while maintaining identity fidelity, we propose an editing direction loss for training, which aligns the editing directions of our MasterWeaver with those of the original T2I model. Additionally, a face-augmented dataset is constructed to facilitate disentangled identity learning, and further improve the editability. Extensive experiments demonstrate that our MasterWeaver can not only generate personalized images with faithful identity, but also exhibit superiority in text controllability. Our code will be publicly available at https://github.com/csyxwei/MasterWeaver. △ Less

Submitted 10 May, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

Comments: 34 pages

arXiv:2405.02171 [pdf, other]

Self-Supervised Learning for Real-World Super-Resolution from Dual and Multiple Zoomed Observations

Authors: Zhilu Zhang, Ruohao Wang, Hongzhi Zhang, Wangmeng Zuo

Abstract: In this paper, we consider two challenging issues in reference-based super-resolution (RefSR) for smartphone, (i) how to choose a proper reference image, and (ii) how to learn RefSR in a self-supervised manner. Particularly, we propose a novel self-supervised learning approach for real-world RefSR from observations at dual and multiple camera zooms. Firstly, considering the popularity of multiple… ▽ More In this paper, we consider two challenging issues in reference-based super-resolution (RefSR) for smartphone, (i) how to choose a proper reference image, and (ii) how to learn RefSR in a self-supervised manner. Particularly, we propose a novel self-supervised learning approach for real-world RefSR from observations at dual and multiple camera zooms. Firstly, considering the popularity of multiple cameras in modern smartphones, the more zoomed (telephoto) image can be naturally leveraged as the reference to guide the super-resolution (SR) of the lesser zoomed (ultra-wide) image, which gives us a chance to learn a deep network that performs SR from the dual zoomed observations (DZSR). Secondly, for self-supervised learning of DZSR, we take the telephoto image instead of an additional high-resolution image as the supervision information, and select a center patch from it as the reference to super-resolve the corresponding ultra-wide image patch. To mitigate the effect of the misalignment between ultra-wide low-resolution (LR) patch and telephoto ground-truth (GT) image during training, we first adopt patch-based optical flow alignment and then design an auxiliary-LR to guide the deforming of the warped LR features. To generate visually pleasing results, we present local overlapped sliced Wasserstein loss to better represent the perceptual difference between GT and output in the feature space. During testing, DZSR can be directly deployed to super-solve the whole ultra-wide image with the reference of the telephoto image. In addition, we further take multiple zoomed observations to explore self-supervised RefSR, and present a progressive fusion scheme for the effective utilization of reference images. Experiments show that our methods achieve better quantitative and qualitative performance against state-of-the-arts. Codes are available at https://github.com/cszhilu1998/SelfDZSR_PlusPlus. △ Less

Submitted 3 May, 2024; originally announced May 2024.

Comments: Accpted by IEEE TPAMI in 2024. Extended version of ECCV 2022 paper "Self-Supervised Learning for Real-World Super-Resolution from Dual Zoomed Observations" (arXiv:2203.01325)

arXiv:2404.17364 [pdf, other]

MV-VTON: Multi-View Virtual Try-On with Diffusion Models

Authors: Haoyu Wang, Zhilu Zhang, Donglin Di, Shiliang Zhang, Wangmeng Zuo

Abstract: The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, most existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge,… ▽ More The goal of image-based virtual try-on is to generate an image of the target person naturally wearing the given clothing. However, most existing methods solely focus on the frontal try-on using the frontal clothing. When the views of the clothing and person are significantly inconsistent, particularly when the person's view is non-frontal, the results are unsatisfactory. To address this challenge, we introduce Multi-View Virtual Try-ON (MV-VTON), which aims to reconstruct the dressing results of a person from multiple views using the given clothes. On the one hand, given that single-view clothes provide insufficient information for MV-VTON, we instead employ two images, i.e., the frontal and back views of the clothing, to encompass the complete view as much as possible. On the other hand, the diffusion models that have demonstrated superior abilities are adopted to perform our MV-VTON. In particular, we propose a view-adaptive selection method where hard-selection and soft-selection are applied to the global and local clothing feature extraction, respectively. This ensures that the clothing features are roughly fit to the person's view. Subsequently, we suggest a joint attention block to align and fuse clothing features with person features. Additionally, we collect a MV-VTON dataset, i.e., Multi-View Garment (MVG), in which each person has multiple photos with diverse views and poses. Experiments show that the proposed method not only achieves state-of-the-art results on MV-VTON task using our MVG dataset, but also has superiority on frontal-view virtual try-on task using VITON-HD and DressCode datasets. Codes and datasets will be publicly released at https://github.com/hywang2002/MV-VTON . △ Less

Submitted 29 April, 2024; v1 submitted 26 April, 2024; originally announced April 2024.

Comments: 15 pages

arXiv:2404.17270 [pdf, other]

Empirical Studies of Propagation Characteristics and Modeling Based on XL-MIMO Channel Measurement: From Far-Field to Near-Field

Authors: Haiyang Miao, Jianhua Zhang, Pan Tang, Lei Tian, Weirang Zuo, Qi Wei, Guangyi Liu

Abstract: In the sixth-generation (6G), the extremely large-scale multiple-input-multiple-output (XL-MIMO) is considered a promising enabling technology. With the further expansion of array element number and frequency bands, near-field effects will be more likely to occur in 6G communication systems. The near-field radio communications (NFRC) will become crucial in 6G communication systems. It is known tha… ▽ More In the sixth-generation (6G), the extremely large-scale multiple-input-multiple-output (XL-MIMO) is considered a promising enabling technology. With the further expansion of array element number and frequency bands, near-field effects will be more likely to occur in 6G communication systems. The near-field radio communications (NFRC) will become crucial in 6G communication systems. It is known that the channel research is very important for the development and performance evaluation of the communication systems. In this paper, we will systematically investigate the channel measurements and modeling for the emerging NFRC. First, the principle design of massive MIMO channel measurement platform are solved. Second, an indoor XL-MIMO channel measurement campaign with 1600 array elements is conducted, and the channel characteristics are extracted and validated in the near-field region. Then, the outdoor XL-MIMO channel measurement campaign with 320 array elements is conducted, and the channel characteristics are extracted and modeled from near-field to far-field (NF-FF) region. The spatial non-stationary characteristics of angular spread at the transmitting end are more important in modeling. We hope that this work will give some reference to the near-field and far-field research for 6G. △ Less

Submitted 26 April, 2024; originally announced April 2024.

arXiv:2404.16331 [pdf, other]

IMWA: Iterative Model Weight Averaging Benefits Class-Imbalanced Learning Tasks

Authors: Zitong Huang, Ze Chen, Bowen Dong, Chaoqi Liang, Erjin Zhou, Wangmeng Zuo

Abstract: Model Weight Averaging (MWA) is a technique that seeks to enhance model's performance by averaging the weights of multiple trained models. This paper first empirically finds that 1) the vanilla MWA can benefit the class-imbalanced learning, and 2) performing model averaging in the early epochs of training yields a greater performance improvement than doing that in later epochs. Inspired by these t… ▽ More Model Weight Averaging (MWA) is a technique that seeks to enhance model's performance by averaging the weights of multiple trained models. This paper first empirically finds that 1) the vanilla MWA can benefit the class-imbalanced learning, and 2) performing model averaging in the early epochs of training yields a greater performance improvement than doing that in later epochs. Inspired by these two observations, in this paper we propose a novel MWA technique for class-imbalanced learning tasks named Iterative Model Weight Averaging (IMWA). Specifically, IMWA divides the entire training stage into multiple episodes. Within each episode, multiple models are concurrently trained from the same initialized model weight, and subsequently averaged into a singular model. Then, the weight of this average model serves as a fresh initialization for the ensuing episode, thus establishing an iterative learning paradigm. Compared to vanilla MWA, IMWA achieves higher performance improvements with the same computational cost. Moreover, IMWA can further enhance the performance of those methods employing EMA strategy, demonstrating that IMWA and EMA can complement each other. Extensive experiments on various class-imbalanced learning tasks, i.e., class-imbalanced image classification, semi-supervised class-imbalanced image classification and semi-supervised object detection tasks showcase the effectiveness of our IMWA. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.11313 [pdf, other]

NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Authors: Xin Li, Kun Yuan, Yajing Pei, Yiting Lu, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, Wei Sun, Haoning Wu, Zicheng Zhang, Jun Jia, Zhichao Zhang, Linhan Cao, Qiubo Chen, Xiongkuo Min, Weisi Lin, Guangtao Zhai, Jianhui Sun, Tianyi Wang, Lei Li, Han Kong, Wenxuan Wang, Bing Li, Cheng Luo , et al. (43 additional authors not shown)

Abstract: This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The… ▽ More This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR2024 Workshop. The challenge report for CVPR NTIRE2024 Short-form UGC Video Quality Assessment Challenge

arXiv:2404.08514 [pdf, other]

NIR-Assisted Image Denoising: A Selective Fusion Approach and A Real-World Benchmark Dataset

Authors: Rongjian Xu, Zhilu Zhang, Renlong Wu, Wangmeng Zuo

Abstract: Despite the significant progress in image denoising, it is still challenging to restore fine-scale details while removing noise, especially in extremely low-light environments. Leveraging near-infrared (NIR) images to assist visible RGB image denoising shows the potential to address this issue, becoming a promising technology. Nonetheless, existing works still struggle with taking advantage of NIR… ▽ More Despite the significant progress in image denoising, it is still challenging to restore fine-scale details while removing noise, especially in extremely low-light environments. Leveraging near-infrared (NIR) images to assist visible RGB image denoising shows the potential to address this issue, becoming a promising technology. Nonetheless, existing works still struggle with taking advantage of NIR information effectively for real-world image denoising, due to the content inconsistency between NIR-RGB images and the scarcity of real-world paired datasets. To alleviate the problem, we propose an efficient Selective Fusion Module (SFM), which can be plug-and-played into the advanced denoising networks to merge the deep NIR-RGB features. Specifically, we sequentially perform the global and local modulation for NIR and RGB features, and then integrate the two modulated features. Furthermore, we present a Real-world NIR-Assisted Image Denoising (Real-NAID) dataset, which covers diverse scenarios as well as various noise levels. Extensive experiments on both synthetic and our real-world datasets demonstrate that the proposed method achieves better results than state-of-the-art ones. The dataset, codes, and pre-trained models will be publicly available at https://github.com/ronjonxu/NAID. △ Less

Submitted 18 April, 2024; v1 submitted 12 April, 2024; originally announced April 2024.

Comments: 10 pages

arXiv:2404.07846 [pdf, other]

TBSN: Transformer-Based Blind-Spot Network for Self-Supervised Image Denoising

Authors: Junyi Li, Zhilu Zhang, Wangmeng Zuo

Abstract: Blind-spot networks (BSN) have been prevalent network architectures in self-supervised image denoising (SSID). Existing BSNs are mostly conducted with convolution layers. Although transformers offer potential solutions to the limitations of convolutions and have demonstrated success in various image restoration tasks, their attention mechanisms may violate the blind-spot requirement, thus restrict… ▽ More Blind-spot networks (BSN) have been prevalent network architectures in self-supervised image denoising (SSID). Existing BSNs are mostly conducted with convolution layers. Although transformers offer potential solutions to the limitations of convolutions and have demonstrated success in various image restoration tasks, their attention mechanisms may violate the blind-spot requirement, thus restricting their applicability in SSID. In this paper, we present a transformer-based blind-spot network (TBSN) by analyzing and redesigning the transformer operators that meet the blind-spot requirement. Specifically, TBSN follows the architectural principles of dilated BSNs, and incorporates spatial as well as channel self-attention layers to enhance the network capability. For spatial self-attention, an elaborate mask is applied to the attention matrix to restrict its receptive field, thus mimicking the dilated convolution. For channel self-attention, we observe that it may leak the blind-spot information when the channel number is greater than spatial size in the deep layers of multi-scale architectures. To eliminate this effect, we divide the channel into several groups and perform channel attention separately. Furthermore, we introduce a knowledge distillation strategy that distills TBSN into smaller denoisers to improve computational efficiency while maintaining performance. Extensive experiments on real-world image denoising datasets show that TBSN largely extends the receptive field and exhibits favorable performance against state-of-the-art SSID methods. The code and pre-trained models will be publicly available at https://github.com/nagejacob/TBSN. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.06451 [pdf, other]

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

Authors: Xiaoyu Liu, Yuxiang Wei, Ming Liu, Xianhui Lin, Peiran Ren, Xuansong Xie, Wangmeng Zuo

Abstract: Human visual imagination usually begins with analogies or rough sketches. For example, given an image with a girl playing guitar before a building, one may analogously imagine how it seems like if Iron Man playing guitar before Pyramid in Egypt. Nonetheless, visual condition may not be precisely aligned with the imaginary result indicated by text prompt, and existing layout-controllable text-to-im… ▽ More Human visual imagination usually begins with analogies or rough sketches. For example, given an image with a girl playing guitar before a building, one may analogously imagine how it seems like if Iron Man playing guitar before Pyramid in Egypt. Nonetheless, visual condition may not be precisely aligned with the imaginary result indicated by text prompt, and existing layout-controllable text-to-image (T2I) generation models is prone to producing degraded generated results with obvious artifacts. To address this issue, we present a novel T2I generation method dubbed SmartControl, which is designed to modify the rough visual conditions for adapting to text prompt. The key idea of our SmartControl is to relax the visual condition on the areas that are conflicted with text prompts. In specific, a Control Scale Predictor (CSP) is designed to identify the conflict regions and predict the local control scales, while a dataset with text prompts and rough visual conditions is constructed for training CSP. It is worth noting that, even with a limited number (e.g., 1,000~2,000) of training samples, our SmartControl can generalize well to unseen objects. Extensive experiments on four typical visual condition types clearly show the efficacy of our SmartControl against state-of-the-arts. Source code, pre-trained models, and datasets are available at https://github.com/liuxiaoyu1104/SmartControl. △ Less

Submitted 9 April, 2024; originally announced April 2024.

arXiv:2404.05580 [pdf, other]

Responsible Visual Editing

Authors: Minheng Ni, Yeli Shen, Lei Zhang, Wangmeng Zuo

Abstract: With recent advancements in visual synthesis, there is a growing risk of encountering images with detrimental effects, such as hate, discrimination, or privacy violations. The research on transforming harmful images into responsible ones remains unexplored. In this paper, we formulate a new task, responsible visual editing, which entails modifying specific concepts within an image to render it mor… ▽ More With recent advancements in visual synthesis, there is a growing risk of encountering images with detrimental effects, such as hate, discrimination, or privacy violations. The research on transforming harmful images into responsible ones remains unexplored. In this paper, we formulate a new task, responsible visual editing, which entails modifying specific concepts within an image to render it more responsible while minimizing changes. However, the concept that needs to be edited is often abstract, making it challenging to locate what needs to be modified and plan how to modify it. To tackle these challenges, we propose a Cognitive Editor (CoEditor) that harnesses the large multimodal model through a two-stage cognitive process: (1) a perceptual cognitive process to focus on what needs to be modified and (2) a behavioral cognitive process to strategize how to modify. To mitigate the negative implications of harmful images on research, we create a transparent and public dataset, AltBear, which expresses harmful information using teddy bears instead of humans. Experiments demonstrate that CoEditor can effectively comprehend abstract concepts within complex scenes and significantly surpass the performance of baseline models for responsible visual editing. We find that the AltBear dataset corresponds well to the harmful content found in real images, offering a consistent experimental evaluation, thereby providing a safer benchmark for future research. Moreover, CoEditor also shows great results in general editing. We release our code and dataset at https://github.com/kodenii/Responsible-Visual-Editing. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: 24 pages, 12 figures

arXiv:2404.05268 [pdf, other]

MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation

Authors: Jiaxiu Jiang, Yabo Zhang, Kailai Feng, Xiaohe Wu, Wangmeng Zuo

Abstract: Customized text-to-image generation aims to synthesize instantiations of user-specified concepts and has achieved unprecedented progress in handling individual concept. However, when extending to multiple customized concepts, existing methods exhibit limitations in terms of flexibility and fidelity, only accommodating the combination of limited types of models and potentially resulting in a mix of… ▽ More Customized text-to-image generation aims to synthesize instantiations of user-specified concepts and has achieved unprecedented progress in handling individual concept. However, when extending to multiple customized concepts, existing methods exhibit limitations in terms of flexibility and fidelity, only accommodating the combination of limited types of models and potentially resulting in a mix of characteristics from different concepts. In this paper, we introduce the Multi-concept guidance for Multi-concept customization, termed MC$^2$, for improved flexibility and fidelity. MC$^2$ decouples the requirements for model architecture via inference time optimization, allowing the integration of various heterogeneous single-concept customized models. It adaptively refines the attention weights between visual and textual tokens, directing image regions to focus on their associated words while diminishing the impact of irrelevant ones. Extensive experiments demonstrate that MC$^2$ even surpasses previous methods that require additional training in terms of consistency with input prompt and reference images. Moreover, MC$^2$ can be extended to elevate the compositional capabilities of text-to-image generation, yielding appealing results. Code will be publicly available at https://github.com/JIANGJiaXiu/MC-2. △ Less

Submitted 12 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.04908 [pdf, other]

Dual-Camera Smooth Zoom on Mobile Phones

Authors: Renlong Wu, Zhilu Zhang, Yu Yang, Wangmeng Zuo

Abstract: When zooming between dual cameras on a mobile, noticeable jumps in geometric content and image color occur in the preview, inevitably affecting the user's zoom experience. In this work, we introduce a new task, ie, dual-camera smooth zoom (DCSZ) to achieve a smooth zoom preview. The frame interpolation (FI) technique is a potential solution but struggles with ground-truth collection. To address th… ▽ More When zooming between dual cameras on a mobile, noticeable jumps in geometric content and image color occur in the preview, inevitably affecting the user's zoom experience. In this work, we introduce a new task, ie, dual-camera smooth zoom (DCSZ) to achieve a smooth zoom preview. The frame interpolation (FI) technique is a potential solution but struggles with ground-truth collection. To address the issue, we suggest a data factory solution where continuous virtual cameras are assembled to generate DCSZ data by rendering reconstructed 3D models of the scene. In particular, we propose a novel dual-camera smooth zoom Gaussian Splatting (ZoomGS), where a camera-specific encoding is introduced to construct a specific 3D model for each virtual camera. With the proposed data factory, we construct a synthetic dataset for DCSZ, and we utilize it to fine-tune FI models. In addition, we collect real-world dual-zoom images without ground-truth for evaluation. Extensive experiments are conducted with multiple FI methods. The results show that the fine-tuned FI models achieve a significant performance improvement over the original ones on DCSZ task. The datasets, codes, and pre-trained models will be publicly available. △ Less

Submitted 7 April, 2024; originally announced April 2024.

Comments: 24

arXiv:2404.04833 [pdf, other]

ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

Authors: Binghui Chen, Wenyu Li, Yifeng Geng, Xuansong Xie, Wangmeng Zuo

Abstract: With the development of the large-scale diffusion model, Artificial Intelligence Generated Content (AIGC) techniques are popular recently. However, how to truly make it serve our daily lives remains an open question. To this end, in this paper, we focus on employing AIGC techniques in one filed of E-commerce marketing, i.e., generating hyper-realistic advertising images for displaying user-specifi… ▽ More With the development of the large-scale diffusion model, Artificial Intelligence Generated Content (AIGC) techniques are popular recently. However, how to truly make it serve our daily lives remains an open question. To this end, in this paper, we focus on employing AIGC techniques in one filed of E-commerce marketing, i.e., generating hyper-realistic advertising images for displaying user-specified shoes by human. Specifically, we propose a shoe-wearing system, called Shoe-Model, to generate plausible images of human legs interacting with the given shoes. It consists of three modules: (1) shoe wearable-area detection module (WD), (2) leg-pose synthesis module (LpS) and the final (3) shoe-wearing image generation module (SW). Them three are performed in ordered stages. Compared to baselines, our ShoeModel is shown to generalize better to different type of shoes and has ability of keeping the ID-consistency of the given shoes, as well as automatically producing reasonable interactions with human. Extensive experiments show the effectiveness of our proposed shoe-wearing system. Figure 1 shows the input and output examples of our ShoeModel. △ Less

Submitted 7 April, 2024; originally announced April 2024.

Comments: 16 pages

arXiv:2404.04317 [pdf, other]

DeepLINK-T: deep learning inference for time series data using knockoffs and LSTM

Authors: Wenxuan Zuo, Zifan Zhu, Yuxuan Du, Yi-Chun Yeh, Jed A. Fuhrman, Jinchi Lv, Yingying Fan, Fengzhu Sun

Abstract: High-dimensional longitudinal time series data is prevalent across various real-world applications. Many such applications can be modeled as regression problems with high-dimensional time series covariates. Deep learning has been a popular and powerful tool for fitting these regression models. Yet, the development of interpretable and reproducible deep-learning models is challenging and remains un… ▽ More High-dimensional longitudinal time series data is prevalent across various real-world applications. Many such applications can be modeled as regression problems with high-dimensional time series covariates. Deep learning has been a popular and powerful tool for fitting these regression models. Yet, the development of interpretable and reproducible deep-learning models is challenging and remains underexplored. This study introduces a novel method, Deep Learning Inference using Knockoffs for Time series data (DeepLINK-T), focusing on the selection of significant time series variables in regression while controlling the false discovery rate (FDR) at a predetermined level. DeepLINK-T combines deep learning with knockoff inference to control FDR in feature selection for time series models, accommodating a wide variety of feature distributions. It addresses dependencies across time and features by leveraging a time-varying latent factor structure in time series covariates. Three key ingredients for DeepLINK-T are 1) a Long Short-Term Memory (LSTM) autoencoder for generating time series knockoff variables, 2) an LSTM prediction network using both original and knockoff variables, and 3) the application of the knockoffs framework for variable selection with FDR control. Extensive simulation studies have been conducted to evaluate DeepLINK-T's performance, showing its capability to control FDR effectively while demonstrating superior feature selection power for high-dimensional longitudinal time series data compared to its non-time series counterpart. DeepLINK-T is further applied to three metagenomic data sets, validating its practical utility and effectiveness, and underscoring its potential in real-world applications. △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2403.11192 [pdf, other]

Self-Supervised Video Desmoking for Laparoscopic Surgery

Authors: Renlong Wu, Zhilu Zhang, Shuohao Zhang, Longfei Gou, Haobin Chen, Lei Zhang, Hao Chen, Wangmeng Zuo

Abstract: Due to the difficulty of collecting real paired data, most existing desmoking methods train the models by synthesizing smoke, generalizing poorly to real surgical scenarios. Although a few works have explored single-image real-world desmoking in unpaired learning manners, they still encounter challenges in handling dense smoke. In this work, we address these issues together by introducing the self… ▽ More Due to the difficulty of collecting real paired data, most existing desmoking methods train the models by synthesizing smoke, generalizing poorly to real surgical scenarios. Although a few works have explored single-image real-world desmoking in unpaired learning manners, they still encounter challenges in handling dense smoke. In this work, we address these issues together by introducing the self-supervised surgery video desmoking (SelfSVD). On the one hand, we observe that the frame captured before the activation of high-energy devices is generally clear (named pre-smoke frame, PS frame), thus it can serve as supervision for other smoky frames, making real-world self-supervised video desmoking practically feasible. On the other hand, in order to enhance the desmoking performance, we further feed the valuable information from PS frame into models, where a masking strategy and a regularization term are presented to avoid trivial solutions. In addition, we construct a real surgery video dataset for desmoking, which covers a variety of smoky scenes. Extensive experiments on the dataset show that our SelfSVD can remove smoke more effectively and efficiently while recovering more photo-realistic details than the state-of-the-art methods. The dataset, codes, and pre-trained models are available at \url{https://github.com/ZcsrenlongZ/SelfSVD}. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: 28 pages

arXiv:2403.07290 [pdf, other]

Learning Hierarchical Color Guidance for Depth Map Super-Resolution

Authors: Runmin Cong, Ronghui Sheng, Hao Wu, Yulan Guo, Yunchao Wei, Wangmeng Zuo, Yao Zhao, Sam Kwong

Abstract: Color information is the most commonly used prior knowledge for depth map super-resolution (DSR), which can provide high-frequency boundary guidance for detail restoration. However, its role and functionality in DSR have not been fully developed. In this paper, we rethink the utilization of color information and propose a hierarchical color guidance network to achieve DSR. On the one hand, the low… ▽ More Color information is the most commonly used prior knowledge for depth map super-resolution (DSR), which can provide high-frequency boundary guidance for detail restoration. However, its role and functionality in DSR have not been fully developed. In this paper, we rethink the utilization of color information and propose a hierarchical color guidance network to achieve DSR. On the one hand, the low-level detail embedding module is designed to supplement high-frequency color information of depth features in a residual mask manner at the low-level stages. On the other hand, the high-level abstract guidance module is proposed to maintain semantic consistency in the reconstruction process by using a semantic mask that encodes the global guidance information. The color information of these two dimensions plays a role in the front and back ends of the attention-based feature projection (AFP) module in a more comprehensive form. Simultaneously, the AFP module integrates the multi-scale content enhancement block and adaptive attention projection block to make full use of multi-scale information and adaptively project critical restoration information in an attention manner for DSR. Compared with the state-of-the-art methods on four benchmark datasets, our method achieves more competitive performance both qualitatively and quantitatively. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2403.05807 [pdf, other]

A self-supervised CNN for image watermark removal

Authors: Chunwei Tian, Menghua Zheng, Tiancai Jiao, Wangmeng Zuo, Yanning Zhang, Chia-Wen Lin

Abstract: Popular convolutional neural networks mainly use paired images in a supervised way for image watermark removal. However, watermarked images do not have reference images in the real world, which results in poor robustness of image watermark removal techniques. In this paper, we propose a self-supervised convolutional neural network (CNN) in image watermark removal (SWCNN). SWCNN uses a self-supervi… ▽ More Popular convolutional neural networks mainly use paired images in a supervised way for image watermark removal. However, watermarked images do not have reference images in the real world, which results in poor robustness of image watermark removal techniques. In this paper, we propose a self-supervised convolutional neural network (CNN) in image watermark removal (SWCNN). SWCNN uses a self-supervised way to construct reference watermarked images rather than given paired training samples, according to watermark distribution. A heterogeneous U-Net architecture is used to extract more complementary structural information via simple components for image watermark removal. Taking into account texture information, a mixed loss is exploited to improve visual effects of image watermark removal. Besides, a watermark dataset is conducted. Experimental results show that the proposed SWCNN is superior to popular CNNs in image watermark removal. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2403.05438 [pdf, other]

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Authors: Yabo Zhang, Yuxiang Wei, Xianhui Lin, Zheng Hui, Peiran Ren, Xuansong Xie, Xiangyang Ji, Wangmeng Zuo

Abstract: Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates… ▽ More Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Our code is available at https://github.com/YBYBZhang/VideoElevator. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: Project page: https://videoelevator.github.io Code: https://github.com/YBYBZhang/VideoElevator

arXiv:2403.05428 [pdf, other]

Towards Real-World Stickers Use: A New Dataset for Multi-Tag Sticker Recognition

Authors: Bingbing Wang, Bin Liang, Chun-Mei Feng, Wangmeng Zuo, Zhixin Bai, Shijue Huang, Kam-Fai Wong, Xi Zeng, Ruifeng Xu

Abstract: In real-world conversations, the diversity and ambiguity of stickers often lead to varied interpretations based on the context, necessitating the requirement for comprehensively understanding stickers and supporting multi-tagging. To address this challenge, we introduce StickerTAG, the first multi-tag sticker dataset comprising a collected tag set with 461 tags and 13,571 sticker-tag pairs, design… ▽ More In real-world conversations, the diversity and ambiguity of stickers often lead to varied interpretations based on the context, necessitating the requirement for comprehensively understanding stickers and supporting multi-tagging. To address this challenge, we introduce StickerTAG, the first multi-tag sticker dataset comprising a collected tag set with 461 tags and 13,571 sticker-tag pairs, designed to provide a deeper understanding of stickers. Recognizing multiple tags for stickers becomes particularly challenging due to sticker tags usually are fine-grained attribute aware. Hence, we propose an Attentive Attribute-oriented Prompt Learning method, ie, Att$^2$PL, to capture informative features of stickers in a fine-grained manner to better differentiate tags. Specifically, we first apply an Attribute-oriented Description Generation (ADG) module to obtain the description for stickers from four attributes. Then, a Local Re-attention (LoR) module is designed to perceive the importance of local information. Finally, we use prompt learning to guide the recognition process and adopt confidence penalty optimization to penalize the confident output distribution. Extensive experiments show that our method achieves encouraging results for all commonly used metrics. △ Less

Submitted 16 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.01852 [pdf, other]

PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis

Authors: Zhengyao Lv, Yuxiang Wei, Wangmeng Zuo, Kwan-Yee K. Wong

Abstract: Recent advancements in large-scale pre-trained text-to-image models have led to remarkable progress in semantic image synthesis. Nevertheless, synthesizing high-quality images with consistent semantics and layout remains a challenge. In this paper, we propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues. Specifically, w… ▽ More Recent advancements in large-scale pre-trained text-to-image models have led to remarkable progress in semantic image synthesis. Nevertheless, synthesizing high-quality images with consistent semantics and layout remains a challenge. In this paper, we propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues. Specifically, we first employ the layout control map to faithfully represent layouts in the feature space. Subsequently, we combine the layout and semantic features in a timestep-adaptive manner to synthesize images with realistic details. During fine-tuning, we propose the Semantic Alignment (SA) loss to further enhance layout alignment. Additionally, we introduce the Layout-Free Prior Preservation (LFP) loss, which leverages unlabeled data to maintain the priors of pre-trained models, thereby improving the visual quality and semantic consistency of synthesized images. Extensive experiments demonstrate that our approach performs favorably in terms of visual quality, semantic consistency, and layout alignment. The source code and model are available at https://github.com/cszy98/PLACE/tree/main. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.16674 [pdf, other]

ConSept: Continual Semantic Segmentation via Adapter-based Vision Transformer

Authors: Bowen Dong, Guanglei Yang, Wangmeng Zuo, Lei Zhang

Abstract: In this paper, we delve into the realm of vision transformers for continual semantic segmentation, a problem that has not been sufficiently explored in previous literature. Empirical investigations on the adaptation of existing frameworks to vanilla ViT reveal that incorporating visual adapters into ViTs or fine-tuning ViTs with distillation terms is advantageous for enhancing the segmentation cap… ▽ More In this paper, we delve into the realm of vision transformers for continual semantic segmentation, a problem that has not been sufficiently explored in previous literature. Empirical investigations on the adaptation of existing frameworks to vanilla ViT reveal that incorporating visual adapters into ViTs or fine-tuning ViTs with distillation terms is advantageous for enhancing the segmentation capability of novel classes. These findings motivate us to propose Continual semantic Segmentation via Adapter-based ViT, namely ConSept. Within the simplified architecture of ViT with linear segmentation head, ConSept integrates lightweight attention-based adapters into vanilla ViTs. Capitalizing on the feature adaptation abilities of these adapters, ConSept not only retains superior segmentation ability for old classes, but also attains promising segmentation quality for novel classes. To further harness the intrinsic anti-catastrophic forgetting ability of ConSept and concurrently enhance the segmentation capabilities for both old and new classes, we propose two key strategies: distillation with a deterministic old-classes boundary for improved anti-catastrophic forgetting, and dual dice losses to regularize segmentation maps, thereby improving overall segmentation performance. Extensive experiments show the effectiveness of ConSept on multiple continual semantic segmentation benchmarks under overlapped or disjoint settings. Code will be publicly available at \url{https://github.com/DongSky/ConSept}. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.15704 [pdf, other]

A Heterogeneous Dynamic Convolutional Neural Network for Image Super-resolution

Authors: Chunwei Tian, Xuanyu Zhang, Jia Ren, Wangmeng Zuo, Yanning Zhang, Chia-Wen Lin

Abstract: Convolutional neural networks can automatically learn features via deep network architectures and given input samples. However, robustness of obtained models may have challenges in varying scenes. Bigger differences of a network architecture are beneficial to extract more complementary structural information to enhance robustness of an obtained super-resolution model. In this paper, we present a h… ▽ More Convolutional neural networks can automatically learn features via deep network architectures and given input samples. However, robustness of obtained models may have challenges in varying scenes. Bigger differences of a network architecture are beneficial to extract more complementary structural information to enhance robustness of an obtained super-resolution model. In this paper, we present a heterogeneous dynamic convolutional network in image super-resolution (HDSRNet). To capture more information, HDSRNet is implemented by a heterogeneous parallel network. The upper network can facilitate more contexture information via stacked heterogeneous blocks to improve effects of image super-resolution. Each heterogeneous block is composed of a combination of a dilated, dynamic, common convolutional layers, ReLU and residual learning operation. It can not only adaptively adjust parameters, according to different inputs, but also prevent long-term dependency problem. The lower network utilizes a symmetric architecture to enhance relations of different layers to mine more structural information, which is complementary with a upper network for image super-resolution. The relevant experimental results show that the proposed HDSRNet is effective to deal with image resolving. The code of HDSRNet can be obtained at https://github.com/hellloxiaotian/HDSRNet. △ Less

Submitted 23 February, 2024; originally announced February 2024.

Comments: 11pages, 7 figures

arXiv:2402.05044 [pdf, other]

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

Authors: Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, Jing Shao

Abstract: In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy s… ▽ More In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.SALAD-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive experiments shed light on the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and evaluator are released under https://github.com/OpenSafetyLab/SALAD-BENCH. △ Less

Submitted 7 June, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

Comments: Accepted at ACL 2024 Findings

arXiv:2402.01166 [pdf, other]

A Comprehensive Survey on 3D Content Generation

Authors: Jian Liu, Xiaoshui Huang, Tianyu Huang, Lu Chen, Yuenan Hou, Shixiang Tang, Ziwei Liu, Wanli Ouyang, Wangmeng Zuo, Junjun Jiang, Xianming Liu

Abstract: Recent years have witnessed remarkable advances in artificial intelligence generated content(AIGC), with diverse input modalities, e.g., text, image, video, audio and 3D. The 3D is the most close visual modality to real-world 3D environment and carries enormous knowledge. The 3D content generation shows both academic and practical values while also presenting formidable technical challenges. This… ▽ More Recent years have witnessed remarkable advances in artificial intelligence generated content(AIGC), with diverse input modalities, e.g., text, image, video, audio and 3D. The 3D is the most close visual modality to real-world 3D environment and carries enormous knowledge. The 3D content generation shows both academic and practical values while also presenting formidable technical challenges. This review aims to consolidate developments within the burgeoning domain of 3D content generation. Specifically, a new taxonomy is proposed that categorizes existing approaches into three types: 3D native generative methods, 2D prior-based 3D generative methods, and hybrid 3D generative methods. The survey covers approximately 60 papers spanning the major techniques. Besides, we discuss limitations of current 3D content generation techniques, and point out open challenges as well as promising directions for future work. Accompanied with this survey, we have established a project website where the resources on 3D content generation research are provided. The project page is available at https://github.com/hitcslj/Awesome-AIGC-3D. △ Less

Submitted 19 March, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Comments: under review

arXiv:2401.17138 [pdf, other]

Nuclear scattering via quantum computing

Authors: Peiyan Wang, Weijie Du, Wei Zuo, James P. Vary

Abstract: We propose a hybrid quantum-classical framework to solve the elastic scattering phase shift of two well-bound nuclei in an uncoupled channel. Within this framework, we develop a many-body formalism in which the continuum scattering states of the two colliding nuclei are regulated by a weak external harmonic oscillator potential with varying strength. Based on our formalism, we propose an approach… ▽ More We propose a hybrid quantum-classical framework to solve the elastic scattering phase shift of two well-bound nuclei in an uncoupled channel. Within this framework, we develop a many-body formalism in which the continuum scattering states of the two colliding nuclei are regulated by a weak external harmonic oscillator potential with varying strength. Based on our formalism, we propose an approach to compute the eigenenergies of the low-lying scattering states of the relative motion of the colliding nuclei as a function of the oscillator strength of the confining potential. Utilizing the modified effective range expansion, we extrapolate the elastic scattering phase shift of the colliding nuclei from these eigenenergies to the limit when the external potential vanishes. In our hybrid approach, we leverage the advantage of quantum computing to solve for these eigenenergies from a set of many-nucleon Hamiltonian eigenvalue problems. These eigenenergies are inputs to classical computers to obtain the phase shift. We demonstrate our framework with two simple problems, where we implement the rodeo algorithm to solve the relevant eigenenergies with the IBM Qiskit quantum simulator. The results of both the spectra and the elastic scattering phase shifts agree well with other theoretical results. △ Less

Submitted 15 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: We welcome comments!

arXiv:2401.01598 [pdf, other]

Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning

Authors: Zitong Huang, Ze Chen, Zhixing Chen, Erjin Zhou, Xinxing Xu, Rick Siow Mong Goh, Yong Liu, Wangmeng Zuo, Chunmei Feng

Abstract: Few-shot Class-Incremental Learning (FSCIL) aims to continuously learn new classes based on very limited training data without forgetting the old ones encountered. Existing studies solely relied on pure visual networks, while in this paper we solved FSCIL by leveraging the Vision-Language model (e.g., CLIP) and propose a simple yet effective framework, named Learning Prompt with Distribution-based… ▽ More Few-shot Class-Incremental Learning (FSCIL) aims to continuously learn new classes based on very limited training data without forgetting the old ones encountered. Existing studies solely relied on pure visual networks, while in this paper we solved FSCIL by leveraging the Vision-Language model (e.g., CLIP) and propose a simple yet effective framework, named Learning Prompt with Distribution-based Feature Replay (LP-DiF). We observe that simply using CLIP for zero-shot evaluation can substantially outperform the most influential methods. Then, prompt tuning technique is involved to further improve its adaptation ability, allowing the model to continually capture specific knowledge from each session. To prevent the learnable prompt from forgetting old knowledge in the new session, we propose a pseudo-feature replay approach. Specifically, we preserve the old knowledge of each class by maintaining a feature-level Gaussian distribution with a diagonal covariance matrix, which is estimated by the image features of training images and synthesized features generated from a VAE. When progressing to a new session, pseudo-features are sampled from old-class distributions combined with training images of the current session to optimize the prompt, thus enabling the model to learn new knowledge while retaining old knowledge. Experiments on three prevalent benchmarks, i.e., CIFAR100, mini-ImageNet, CUB-200, and two more challenging benchmarks, i.e., SUN-397 and CUB-200$^*$ proposed in this paper showcase the superiority of LP-DiF, achieving new state-of-the-art (SOTA) in FSCIL. Code is publicly available at https://github.com/1170300714/LP-DiF. △ Less

Submitted 5 April, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

arXiv:2401.00766 [pdf, other]

Exposure Bracketing is All You Need for Unifying Image Restoration and Enhancement Tasks

Authors: Zhilu Zhang, Shuohao Zhang, Renlong Wu, Zifei Yan, Wangmeng Zuo

Abstract: It is highly desired but challenging to acquire high-quality photos with clear content in low-light environments. Although multi-image processing methods (using burst, dual-exposure, or multi-exposure images) have made significant progress in addressing this issue, they typically focus on specific restoration or enhancement problems, and do not fully explore the potential of utilizing multiple ima… ▽ More It is highly desired but challenging to acquire high-quality photos with clear content in low-light environments. Although multi-image processing methods (using burst, dual-exposure, or multi-exposure images) have made significant progress in addressing this issue, they typically focus on specific restoration or enhancement problems, and do not fully explore the potential of utilizing multiple images. Motivated by the fact that multi-exposure images are complementary in denoising, deblurring, high dynamic range imaging, and super-resolution, we propose to utilize exposure bracketing photography to unify image restoration and enhancement tasks in this work. Due to the difficulty in collecting real-world pairs, we suggest a solution that first pre-trains the model with synthetic paired data and then adapts it to real-world unlabeled images. In particular, a temporally modulated recurrent network (TMRNet) and self-supervised adaptation method are proposed. Moreover, we construct a data simulation pipeline to synthesize pairs and collect real-world images from 200 nighttime scenarios. Experiments on both datasets show that our method performs favorably against the state-of-the-art multi-image processing ones. The dataset, code, and pre-trained models are available at https://github.com/cszhilu1998/BracketIRE. △ Less

Submitted 31 May, 2024; v1 submitted 1 January, 2024; originally announced January 2024.

Comments: 21 pages

arXiv:2312.17334 [pdf, other]

Improving Image Restoration through Removing Degradations in Textual Representations

Authors: Jingbo Lin, Zhilu Zhang, Yuxiang Wei, Dongwei Ren, Dongsheng Jiang, Wangmeng Zuo

Abstract: In this paper, we introduce a new perspective for improving image restoration by removing degradation in the textual representations of a given degraded image. Intuitively, restoration is much easier on text modality than image one. For example, it can be easily conducted by removing degradation-related words while keeping the content-aware words. Hence, we combine the advantages of images in deta… ▽ More In this paper, we introduce a new perspective for improving image restoration by removing degradation in the textual representations of a given degraded image. Intuitively, restoration is much easier on text modality than image one. For example, it can be easily conducted by removing degradation-related words while keeping the content-aware words. Hence, we combine the advantages of images in detail description and ones of text in degradation removal to perform restoration. To address the cross-modal assistance, we propose to map the degraded images into textual representations for removing the degradations, and then convert the restored textual representations into a guidance image for assisting image restoration. In particular, We ingeniously embed an image-to-text mapper and text restoration module into CLIP-equipped text-to-image models to generate the guidance. Then, we adopt a simple coarse-to-fine approach to dynamically inject multi-scale information from guidance to image restoration networks. Extensive experiments are conducted on various image restoration tasks, including deblurring, dehazing, deraining, and denoising, and all-in-one image restoration. The results showcase that our method outperforms state-of-the-art ones across all these tasks. The codes and models are available at \url{https://github.com/mrluin/TextualDegRemoval}. △ Less

Submitted 28 December, 2023; originally announced December 2023.

arXiv:2312.17051 [pdf, other]

FILP-3D: Enhancing 3D Few-shot Class-incremental Learning with Pre-trained Vision-Language Models

Authors: Wan Xu, Tianyu Huang, Tianyu Qu, Guanglei Yang, Yiwen Guo, Wangmeng Zuo

Abstract: Few-shot class-incremental learning (FSCIL) aims to mitigate the catastrophic forgetting issue when a model is incrementally trained on limited data. While the Contrastive Vision-Language Pre-Training (CLIP) model has been effective in addressing 2D few/zero-shot learning tasks, its direct application to 3D FSCIL faces limitations. These limitations arise from feature space misalignment and signif… ▽ More Few-shot class-incremental learning (FSCIL) aims to mitigate the catastrophic forgetting issue when a model is incrementally trained on limited data. While the Contrastive Vision-Language Pre-Training (CLIP) model has been effective in addressing 2D few/zero-shot learning tasks, its direct application to 3D FSCIL faces limitations. These limitations arise from feature space misalignment and significant noise in real-world scanned 3D data. To address these challenges, we introduce two novel components: the Redundant Feature Eliminator (RFE) and the Spatial Noise Compensator (SNC). RFE aligns the feature spaces of input point clouds and their embeddings by performing a unique dimensionality reduction on the feature space of pre-trained models (PTMs), effectively eliminating redundant information without compromising semantic integrity. On the other hand, SNC is a graph-based 3D model designed to capture robust geometric information within point clouds, thereby augmenting the knowledge lost due to projection, particularly when processing real-world scanned data. Considering the imbalance in existing 3D datasets, we also propose new evaluation metrics that offer a more nuanced assessment of a 3D FSCIL model. Traditional accuracy metrics are proved to be biased; thus, our metrics focus on the model's proficiency in learning new classes while maintaining the balance between old and new classes. Experimental results on both established 3D FSCIL benchmarks and our dataset demonstrate that our approach significantly outperforms existing state-of-the-art methods. △ Less

Submitted 28 December, 2023; originally announced December 2023.

Showing 1–50 of 422 results for author: Zuo, W