subscribe to arXiv mailings

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

Authors: Jiasheng Zheng, Boxi Cao, Zhengzhao Ma, Ruotong Pan, Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun

Abstract: In recent years, researchers have proposed numerous benchmarks to evaluate the impressive coding capabilities of large language models (LLMs). However, existing benchmarks primarily focus on assessing the correctness of code generated by LLMs, while neglecting other critical dimensions that also significantly impact code quality. Therefore, this paper proposes the RACE benchmark, which comprehensi… ▽ More In recent years, researchers have proposed numerous benchmarks to evaluate the impressive coding capabilities of large language models (LLMs). However, existing benchmarks primarily focus on assessing the correctness of code generated by LLMs, while neglecting other critical dimensions that also significantly impact code quality. Therefore, this paper proposes the RACE benchmark, which comprehensively evaluates the quality of code generated by LLMs across 4 dimensions: Readability, mAintainability, Correctness, and Efficiency. Specifically, considering the demand-dependent nature of dimensions beyond correctness, we design various types of user requirements for each dimension to assess the model's ability to generate correct code that also meets user demands. We evaluate 18 representative LLMs on RACE and find that: 1) the current LLMs' ability to generate high-quality code on demand does not yet meet the requirements of software development; 2) readability serves as a critical indicator of the overall quality of generated code; 3) most LLMs exhibit an inherent preference for specific coding style. These findings can help researchers gain a deeper understanding of the coding capabilities of current LLMs and shed light on future directions for model improvement. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: We release benchmark at https://github.com/jszheng21/RACE and leaderboard at https://huggingface.co/spaces/jszheng/RACE_leaderboard

arXiv:2407.11341 [pdf, other]

SN 2021dbg: A Luminous Type IIP-IIL Supernova Exploding from a Massive Star with a Layered Shell

Authors: Zeyi Zhao, Jujia Zhang, Liping Li, Qian Zhai, Yongzhi Cai, Shubham Srivastav, Xiaofeng Wang, Han Lin, Yi Yang, Alexei V. Filippenko, Thomas G. Brink, WeiKang Zheng

Abstract: We present extensive observations and analysis of supernova (SN) 2021dbg, utilizing optical photometry and spectroscopy. For approximately 385 days following the explosion, SN 2021dbg exhibited remarkable luminosity, surpassing most SNe II. This initial high luminosity is potentially attributed to the interaction between the ejected material and the surrounding circumstellar material (CSM), as evi… ▽ More We present extensive observations and analysis of supernova (SN) 2021dbg, utilizing optical photometry and spectroscopy. For approximately 385 days following the explosion, SN 2021dbg exhibited remarkable luminosity, surpassing most SNe II. This initial high luminosity is potentially attributed to the interaction between the ejected material and the surrounding circumstellar material (CSM), as evidenced by the pronounced interaction signatures observed in its spectra. The subsequent high luminosity is primarily due to the significant $^{56}$Ni ($0.17 \pm 0.05$ M$_{\odot}$) produced in the explosion. Based on the flux of flash emission lines detected in the initial spectra, we estimate that the CSM mass near the progenitor amounted to $\sim$(1.0--2.0) $\times 10^{-3}$ M$_{\odot}$, likely resulting from intense stellar wind activity 2--3 yr preceding the explosion. Considering the bolometric light curve, nebular spectrum modeling, and mass-loss rate, we suggest that the progenitor of SN 2021dbg was a red supergiant (RSG) with a mass of $\sim 20$ M$_{\odot}$ and a radius of 1200 R$_{\odot}$. This RSG featured a thick hydrogen shell, which may have contained a region with a sharp decrease in material density, electron density, and temperature, contributing to its layered structure. This object demonstrates mixed features of SNe IIP and SNe IIL, making it as a transitional event linking the above two subclasses of SNe II. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.10967 [pdf, other]

BECAUSE: Bilinear Causal Representation for Generalizable Offline Model-based Reinforcement Learning

Authors: Haohong Lin, Wenhao Ding, Jian Chen, Laixi Shi, Jiacheng Zhu, Bo Li, Ding Zhao

Abstract: Offline model-based reinforcement learning (MBRL) enhances data efficiency by utilizing pre-collected datasets to learn models and policies, especially in scenarios where exploration is costly or infeasible. Nevertheless, its performance often suffers from the objective mismatch between model and policy learning, resulting in inferior performance despite accurate model predictions. This paper firs… ▽ More Offline model-based reinforcement learning (MBRL) enhances data efficiency by utilizing pre-collected datasets to learn models and policies, especially in scenarios where exploration is costly or infeasible. Nevertheless, its performance often suffers from the objective mismatch between model and policy learning, resulting in inferior performance despite accurate model predictions. This paper first identifies the primary source of this mismatch comes from the underlying confounders present in offline data for MBRL. Subsequently, we introduce \textbf{B}ilin\textbf{E}ar \textbf{CAUS}al r\textbf{E}presentation~(BECAUSE), an algorithm to capture causal representation for both states and actions to reduce the influence of the distribution shift, thus mitigating the objective mismatch problem. Comprehensive evaluations on 18 tasks that vary in data quality and environment context demonstrate the superior performance of BECAUSE over existing offline RL algorithms. We show the generalizability and robustness of BECAUSE under fewer samples or larger numbers of confounders. Additionally, we offer theoretical analysis of BECAUSE to prove its error bound and sample efficiency when integrating causal representation into offline MBRL. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.10671 [pdf, other]

Qwen2 Technical Report

Authors: An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang , et al. (34 additional authors not shown)

Abstract: This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, a… ▽ More This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors. △ Less

Submitted 16 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

Comments: 25 pages, 1 figure

arXiv:2407.10398 [pdf, ps, other]

Proof of Lew's conjecture on the spectral gap of simplicial complex

Authors: Xiongfeng Zhan, Xueyi Huang, Huiqiu Lin

Abstract: Let $X$ be a simplicial complex on vertex set $V$ of size $n$. Let $X(k)$ denote the set of all $k$-dimensional simplices of $X$, and $\mathrm{deg}_X(σ)=|\{η\in X(k+1):σ\subseteq η\}|$ denote the degree of $σ\in X$. A missing face in $X$ is a subset $σ$ of $V$ such that $σ\notin X$ but $τ\in X$ for any proper subset $τ$ of $σ$. Let $d$ denote the maximal dimension of a missing face of $X$, and… ▽ More Let $X$ be a simplicial complex on vertex set $V$ of size $n$. Let $X(k)$ denote the set of all $k$-dimensional simplices of $X$, and $\mathrm{deg}_X(σ)=|\{η\in X(k+1):σ\subseteq η\}|$ denote the degree of $σ\in X$. A missing face in $X$ is a subset $σ$ of $V$ such that $σ\notin X$ but $τ\in X$ for any proper subset $τ$ of $σ$. Let $d$ denote the maximal dimension of a missing face of $X$, and $μ_k(X)$ denote the $k$-th spectral gap of $X$, i.e., the smallest eigenvalue of the reduced $k$-dimensional Laplacian of $X$. In [J. Combin. Theory Ser. A 169 (2020) 105127], Lew established a lower bound for $μ_k(X)$: $$μ_k(X)\geq (d+1)\left(\min_{σ\in X(k)}\mathrm{deg}_X(σ)+k+1\right)-dn\geq (d+1)(k+1)-dn,$$ and further conjectured that if $μ_k(X)=(d+1)(k+1)-dn$ for some $k$, then $X\cong (Δ_d^{(d-1)})^{*(n-k-1)}*Δ_{(d+1)(k+1)-dn-1}$. In this paper, we confirm Lew's conjecture. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: 14 pages

MSC Class: 05E45

arXiv:2407.10040 [pdf, other]

Lean-STaR: Learning to Interleave Thinking and Proving

Authors: Haohan Lin, Zhiqing Sun, Yiming Yang, Sean Welleck

Abstract: Traditional language model-based theorem proving assumes that by training on a sufficient amount of formal proof data, a model will learn to prove theorems. Our key observation is that a wealth of informal information that is not present in formal proofs can be useful for learning to prove theorems. For instance, humans think through steps of a proof, but this thought process is not visible in the… ▽ More Traditional language model-based theorem proving assumes that by training on a sufficient amount of formal proof data, a model will learn to prove theorems. Our key observation is that a wealth of informal information that is not present in formal proofs can be useful for learning to prove theorems. For instance, humans think through steps of a proof, but this thought process is not visible in the resulting code. We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof, thereby boosting the model's theorem-proving capabilities. Lean-STaR uses retrospective ground-truth tactics to generate synthetic thoughts for training the language model. At inference time, the trained model directly generates the thoughts prior to the prediction of the tactics in each proof step. Building on the self-taught reasoner framework, we then apply expert iteration to further fine-tune the model on the correct proofs it samples and verifies using the Lean solver. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment, significantly outperforming base models ($\boldsymbol{43.4\% \rightarrow 46.3\%,}$ Pass@64). We also analyze the impact of the augmented thoughts on various aspects of the theorem proving process, providing insights into their effectiveness. △ Less

Submitted 13 July, 2024; originally announced July 2024.

arXiv:2407.08301 [pdf, ps, other]

The first Steklov eigenvalue of planar graphs and beyond

Authors: Huiqiu Lin, Da Zhao

Abstract: The Steklov eigenvalue problem was introduced over a century ago, and its discrete form attracted interest recently. Let $D$ and $δΩ$ be the maximum vertex degree and the set of vertices of degree one in a graph $\mathcal{G}$ respectively. Let $λ_2$ be the first (non-trivial) Steklov eigenvalue of $(\mathcal{G}, δΩ)$. In this paper, using the circle packing theorem and conformal mapping, we first… ▽ More The Steklov eigenvalue problem was introduced over a century ago, and its discrete form attracted interest recently. Let $D$ and $δΩ$ be the maximum vertex degree and the set of vertices of degree one in a graph $\mathcal{G}$ respectively. Let $λ_2$ be the first (non-trivial) Steklov eigenvalue of $(\mathcal{G}, δΩ)$. In this paper, using the circle packing theorem and conformal mapping, we first show that $λ_2 \leq 8D / |δΩ|$ for planar graphs. This can be seen as a discrete analogue of Kokarev's bound [Variational aspects of Laplace eigenvalues on Riemannian surfaces, Adv. Math. (2014)], that is, $λ_2 < 8π/ |\partial Ω|$ for compact surfaces with boundary of genus $0$. Let $B$ and $L$ be the maximum block size and the diameter of a block graph $\mathcal{G}$ repsectively. Secondly, we prove that $λ_2 \leq B^2 (D-1)/ |δΩ|$ and $λ_2 \leq (2L + (L-2)(B-2))/L^2$ for block graphs, which extend the results on trees by He and Hua [Upper bounds for the Steklov eigenvalues on trees, Calc. Var. Partial Differential Equations (2022)]. In the end, for trees with fixed leaf number and maximum degree, candidates that achieve the maximum first Steklov eigenvalue are given. △ Less

Submitted 12 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

Comments: 4 figures

MSC Class: 47A75; 49J40; 49R05; 05C10

arXiv:2407.08033 [pdf, other]

Studies of Cherenkov Photon Production in PbF$_2$ Crystals using Proton Beams at Fermilab

Authors: Thomas Anderson, Alberto Belloni, Grace Cummings, Sarah Eno, Nora Fischer, Liang Guan, Yuxiang Guo, Robert Hirosky, James Hirschauer, Yihui Lai, Daniel Levin, Hui-Chi Lin, Mekhala Paranjpe, Jianming Qian, Bing Zhou, Junjie Zhu, Ren-Yuan Zhu

Abstract: Future lepton colliders such as the FCC-ee, CEPC, ILC, or a muon collider will collect large data samples that allow precision physics studies with unprecedented accuracy, especially when the data is collected by innovative state-of-the-art detectors. An electromagnetic calorimeter based on scintillating crystals, designed to separately record Cherenkov and scintillation light, can achieve precisi… ▽ More Future lepton colliders such as the FCC-ee, CEPC, ILC, or a muon collider will collect large data samples that allow precision physics studies with unprecedented accuracy, especially when the data is collected by innovative state-of-the-art detectors. An electromagnetic calorimeter based on scintillating crystals, designed to separately record Cherenkov and scintillation light, can achieve precision measurements of electrons and photons without sacrificing jet energy resolution, given adequate light collection efficiency and separation. This paper presents initial measurements from a program aimed at developing such a calorimeter system for future colliders. We focus on using PbF2 crystals to enhance the understanding of Cherenkov light collection, marking the first step in this endeavor. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: 10 pages

arXiv:2407.06001 [pdf, other]

Pseudo-triplet Guided Few-shot Composed Image Retrieval

Authors: Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Xuemeng Song

Abstract: Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image based on a multimodal query, i.e., a reference image and its corresponding modification text. While previous supervised or zero-shot learning paradigms all fail to strike a good trade-off between time-consuming annotation cost and retrieval performance, recent researchers introduced the task of few-shot CIR… ▽ More Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image based on a multimodal query, i.e., a reference image and its corresponding modification text. While previous supervised or zero-shot learning paradigms all fail to strike a good trade-off between time-consuming annotation cost and retrieval performance, recent researchers introduced the task of few-shot CIR (FS-CIR) and proposed a textual inversion-based network based on pretrained CLIP model to realize it. Despite its promising performance, the approach suffers from two key limitations: insufficient multimodal query composition training and indiscriminative training triplet selection. To address these two limitations, in this work, we propose a novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the first stage, we employ a masked training strategy and advanced image caption generator to construct pseudo triplets from pure image data to enable the model to acquire primary knowledge related to multimodal query composition. In the second stage, based on active learning, we design a pseudo modification text-based query-target distance metric to evaluate the challenging score for each unlabeled sample. Meanwhile, we propose a robust top range-based random sampling strategy according to the 3-$σ$ rule in statistics, to sample the challenging samples for fine-tuning the pretrained model. Notably, our scheme is plug-and-play and compatible with any existing supervised CIR models. We tested our scheme across three backbones on three public datasets (i.e., FashionIQ, CIRR, and Birds-to-Words), achieving maximum improvements of 26.4%, 25.5% and 21.6% respectively, demonstrating our scheme's effectiveness. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 15 pages, 5 figures,

arXiv:2407.05594 [pdf, other]

SLIM: Spuriousness Mitigation with Minimal Human Annotations

Authors: Xiwei Xuan, Ziquan Deng, Hsuan-Tien Lin, Kwan-Liu Ma

Abstract: Recent studies highlight that deep learning models often learn spurious features mistakenly linked to labels, compromising their reliability in real-world scenarios where such correlations do not hold. Despite the increasing research effort, existing solutions often face two main challenges: they either demand substantial annotations of spurious attributes, or they yield less competitive outcomes… ▽ More Recent studies highlight that deep learning models often learn spurious features mistakenly linked to labels, compromising their reliability in real-world scenarios where such correlations do not hold. Despite the increasing research effort, existing solutions often face two main challenges: they either demand substantial annotations of spurious attributes, or they yield less competitive outcomes with expensive training when additional annotations are absent. In this paper, we introduce SLIM, a cost-effective and performance-targeted approach to reducing spurious correlations in deep learning. Our method leverages a human-in-the-loop protocol featuring a novel attention labeling mechanism with a constructed attention representation space. SLIM significantly reduces the need for exhaustive additional labeling, requiring human input for fewer than 3% of instances. By prioritizing data quality over complicated training strategies, SLIM curates a smaller yet more feature-balanced data subset, fostering the development of spuriousness-robust models. Experimental validations across key benchmarks demonstrate that SLIM competes with or exceeds the performance of leading methods while significantly reducing costs. The SLIM framework thus presents a promising path for developing reliable models more efficiently. Our code is available in https://github.com/xiweix/SLIM.git/. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: This paper is accepted by ECCV 2024

arXiv:2407.03917 [pdf, other]

Timestep-Aware Correction for Quantized Diffusion Models

Authors: Yuzhe Yao, Feng Tian, Jun Chen, Haonan Lin, Guang Dai, Yong Liu, Jingdong Wang

Abstract: Diffusion models have marked a significant breakthrough in the synthesis of semantically coherent images. However, their extensive noise estimation networks and the iterative generation process limit their wider application, particularly on resource-constrained platforms like mobile devices. Existing post-training quantization (PTQ) methods have managed to compress diffusion models to low precisio… ▽ More Diffusion models have marked a significant breakthrough in the synthesis of semantically coherent images. However, their extensive noise estimation networks and the iterative generation process limit their wider application, particularly on resource-constrained platforms like mobile devices. Existing post-training quantization (PTQ) methods have managed to compress diffusion models to low precision. Nevertheless, due to the iterative nature of diffusion models, quantization errors tend to accumulate throughout the generation process. This accumulation of error becomes particularly problematic in low-precision scenarios, leading to significant distortions in the generated images. We attribute this accumulation issue to two main causes: error propagation and exposure bias. To address these problems, we propose a timestep-aware correction method for quantized diffusion model, which dynamically corrects the quantization error. By leveraging the proposed method in low-precision diffusion models, substantial enhancement of output quality could be achieved with only negligible computation overhead. Extensive experiments underscore our method's effectiveness and generalizability. By employing the proposed correction strategy, we achieve state-of-the-art (SOTA) results on low-precision models. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: ECCV 2024

arXiv:2407.02940 [pdf, other]

Optical vortex-antivortex crystallization in free space

Authors: Haolin Lin, Yixuan Liao, Guohua Liu, Jianbin Ren, Zhen Li, Zhenqiang Chen, Boris A. Malomed, Shenhe Fu

Abstract: Stable vortex lattices are basic dynamical patterns which have been demonstrated in physical systems including superconductor physics, Bose-Einstein condensates, hydrodynamics and optics. Vortex-antivortex (VAV) ensembles can be produced, self-organizing into the respective polar lattices. However, these structures are in general highly unstable due to the strong VAV attraction. Here, we demonstra… ▽ More Stable vortex lattices are basic dynamical patterns which have been demonstrated in physical systems including superconductor physics, Bose-Einstein condensates, hydrodynamics and optics. Vortex-antivortex (VAV) ensembles can be produced, self-organizing into the respective polar lattices. However, these structures are in general highly unstable due to the strong VAV attraction. Here, we demonstrate that multiple optical VAV clusters nested in the propagating coherent field can crystallize into patterns which preserve their lattice structures over distance up to several Rayleigh lengths. To explain this phenomenon, we present a model for effective interactions between the vortices and antivortices at different lattice sites. The observed VAV crystallization is a consequence of the globally balanced VAV couplings. As the crystallization does not require the presence of nonlinearities and appears in free space, it may find applications to high-capacity optical communications and multiparticle manipulations. Our findings suggest possibilities for constructing VAV complexes through the orbit-orbit couplings, which differs from the extensively studied spin-orbit couplings. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: to be published in Nature Communications; 21pages, 6 figures

arXiv:2407.02327 [pdf, other]

QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices

Authors: Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Yibo Zhu, Chuan Wu

Abstract: A number of production deep learning clusters have attempted to explore inference hardware for DNN training, at the off-peak serving hours with many inference GPUs idling. Conducting DNN training with a combination of heterogeneous training and inference GPUs, known as hybrid device training, presents considerable challenges due to disparities in compute capability and significant differences in m… ▽ More A number of production deep learning clusters have attempted to explore inference hardware for DNN training, at the off-peak serving hours with many inference GPUs idling. Conducting DNN training with a combination of heterogeneous training and inference GPUs, known as hybrid device training, presents considerable challenges due to disparities in compute capability and significant differences in memory capacity. We propose QSync, a training system that enables efficient synchronous data-parallel DNN training over hybrid devices by strategically exploiting quantized operators. According to each device's available resource capacity, QSync selects a quantization-minimized setting for operators in the distributed DNN training graph, minimizing model accuracy degradation but keeping the training efficiency brought by quantization. We carefully design a predictor with a bi-directional mixed-precision indicator to reflect the sensitivity of DNN layers on fixed-point and floating-point low-precision operators, a replayer with a neighborhood-aware cost mapper to accurately estimate the latency of distributed hybrid mixed-precision training, and then an allocator that efficiently synchronizes workers with minimized model accuracy degradation. QSync bridges the computational graph on PyTorch to an optimized backend for quantization kernel performance and flexible support for various GPU architectures. Extensive experiments show that QSync's predictor can accurately simulate distributed mixed-precision training with <5% error, with a consistent 0.27-1.03% accuracy improvement over the from-scratch training tasks compared to uniform precision. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: IPDPS 24

arXiv:2407.01008 [pdf]

Periodic domain inversion in single crystal barium titanate-on-insulator thin film

Authors: Pragati Aashna, Hong-Lin Lin, Yu Cao, Yuhui Yin, Yuan Gao, Sakthi Sanjeev Mohanraj, Di Zhu, Aaron Danner

Abstract: We report experimentally achieving first-ever electric field periodic poling of single crystal barium titanate (BTO, or BaTiO3) thin film on insulator. Owing to the outstanding optical nonlinearities of BTO, this result is a key step towards achieving quasi-phase-matching in BTO. We first grow the BTO thin film on a dysprosium scandate substrate using pulsed laser deposition with a thin layer of s… ▽ More We report experimentally achieving first-ever electric field periodic poling of single crystal barium titanate (BTO, or BaTiO3) thin film on insulator. Owing to the outstanding optical nonlinearities of BTO, this result is a key step towards achieving quasi-phase-matching in BTO. We first grow the BTO thin film on a dysprosium scandate substrate using pulsed laser deposition with a thin layer of strontium ruthenate later serving as the bottom electrode for poling. We present characterization of the BTO thin film using x-ray diffraction and piezo-response force microscopy to clearly demonstrate single crystal, single domain growth of the film which enables the desired periodic poling. To investigate the poling quality, we apply both non-destructive piezo force response microscopy and destructive etching-assisted scanning electron microscopy and we show that high quality, uniform and intransient poling with 50 % duty cycle and periods ranging from 2 μm to 10 μm is achieved. The successful realization of periodic poling in BTO thin film unlocks the potential for highly efficient nonlinear processes under quasi-phase-matching that seemed far-fetched with prior polycrystalline BTO thin films which predominantly relied on efficiency-limited random or non-phase matching conditions and is a key step towards integration of BTO photonic devices. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00614 [pdf, other]

Learning Granularity-Aware Affordances from Human-Object Interaction for Tool-Based Functional Grasping in Dexterous Robotics

Authors: Fan Yang, Wenrui Chen, Kailun Yang, Haoran Lin, DongSheng Luo, Conghui Tang, Zhiyong Li, Yaonan Wang

Abstract: To enable robots to use tools, the initial step is teaching robots to employ dexterous gestures for touching specific areas precisely where tasks are performed. Affordance features of objects serve as a bridge in the functional interaction between agents and objects. However, leveraging these affordance cues to help robots achieve functional tool grasping remains unresolved. To address this, we pr… ▽ More To enable robots to use tools, the initial step is teaching robots to employ dexterous gestures for touching specific areas precisely where tasks are performed. Affordance features of objects serve as a bridge in the functional interaction between agents and objects. However, leveraging these affordance cues to help robots achieve functional tool grasping remains unresolved. To address this, we propose a granularity-aware affordance feature extraction method for locating functional affordance areas and predicting dexterous coarse gestures. We study the intrinsic mechanisms of human tool use. On one hand, we use fine-grained affordance features of object-functional finger contact areas to locate functional affordance regions. On the other hand, we use highly activated coarse-grained affordance features in hand-object interaction regions to predict grasp gestures. Additionally, we introduce a model-based post-processing module that includes functional finger coordinate localization, finger-to-end coordinate transformation, and force feedback-based coarse-to-fine grasping. This forms a complete dexterous robotic functional grasping framework GAAF-Dex, which learns Granularity-Aware Affordances from human-object interaction for tool-based Functional grasping in Dexterous Robotics. Unlike fully-supervised methods that require extensive data annotation, we employ a weakly supervised approach to extract relevant cues from exocentric (Exo) images of hand-object interactions to supervise feature extraction in egocentric (Ego) images. We have constructed a small-scale dataset, FAH, which includes near 6K images of functional hand-object interaction Exo- and Ego images of 18 commonly used tools performing 6 tasks. Extensive experiments on the dataset demonstrate our method outperforms state-of-the-art methods. The code will be made publicly available at https://github.com/yangfan293/GAAF-DEX. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: The source code and the established dataset will be made publicly available at https://github.com/yangfan293/GAAF-DEX

arXiv:2407.00281 [pdf]

Distinguishing Surface and Bulk Electromagnetism via Their Dynamics in an Intrinsic Magnetic Topological Insulator

Authors: Khanh Duy Nguyen, Woojoo Lee, Jianchen Dang, Tongyao Wu, Gabriele Berruto, Chenhui Yan, Chi Ian Jess Ip, Haoran Lin, Qiang Gao, Seng Huat Lee, Binghai Yan, Chaoxing Liu, Zhiqiang Mao, Xiao-Xiao Zhang, Shuolong Yang

Abstract: The indirect exchange interaction between local magnetic moments via surface electrons has been long predicted to bolster the surface ferromagnetism in magnetic topological insulators (MTIs), which facilitates the quantum anomalous Hall effect. This unconventional effect is critical to determining the operating temperatures of future topotronic devices. However, the experimental confirmation of th… ▽ More The indirect exchange interaction between local magnetic moments via surface electrons has been long predicted to bolster the surface ferromagnetism in magnetic topological insulators (MTIs), which facilitates the quantum anomalous Hall effect. This unconventional effect is critical to determining the operating temperatures of future topotronic devices. However, the experimental confirmation of this mechanism remains elusive, especially in intrinsic MTIs. Here we combine time-resolved photoemission spectroscopy with time-resolved magneto-optical Kerr effect measurements to elucidate the unique electromagnetism at the surface of an intrinsic MTI MnBi2Te4. Theoretical modeling based on 2D Ruderman-Kittel-Kasuya-Yosida interactions captures the initial quenching of a surface-rooted exchange gap within a factor of two but over-estimates the bulk demagnetization by one order of magnitude. This mechanism directly explains the sizable gap in the quasi-2D electronic state and the nonzero residual magnetization in even-layer MnBi2Te4. Furthermore, it leads to efficient light-induced demagnetization comparable to state-of-the-art magnetophotonic crystals, promising an effective manipulation of magnetism and topological orders for future topotronics. △ Less

Submitted 28 June, 2024; originally announced July 2024.

Comments: 19 pages, 4 figures

arXiv:2407.00114 [pdf, other]

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Authors: Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang

Abstract: We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimod… ▽ More We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $τ$ = {$o_0$, $a_0$, $\dots$} and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials. △ Less

Submitted 27 June, 2024; originally announced July 2024.

arXiv:2406.20098 [pdf, other]

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

Authors: Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, Zhiqiang Shen

Abstract: Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-t… ▽ More Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at https://github.com/MBZUAI-LLM/web2code. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: Website at https://mbzuai-llm.github.io/webpage2code/

arXiv:2406.19598 [pdf, other]

Mixture of In-Context Experts Enhance LLMs' Long Context Awareness

Authors: Hongzhan Lin, Ang Lv, Yuhan Chen, Chen Zhu, Yang Song, Hengshu Zhu, Rui Yan

Abstract: Many studies have revealed that large language models (LLMs) exhibit uneven awareness of different contextual positions.Their limited context awareness can lead to overlooking critical information and subsequent task failures. While several approaches have been proposed to enhance LLMs' context awareness, achieving both effectiveness and efficiency remains challenging.In this paper, for LLMs utili… ▽ More Many studies have revealed that large language models (LLMs) exhibit uneven awareness of different contextual positions.Their limited context awareness can lead to overlooking critical information and subsequent task failures. While several approaches have been proposed to enhance LLMs' context awareness, achieving both effectiveness and efficiency remains challenging.In this paper, for LLMs utilizing RoPE as position embeddings, we introduce a novel method called ``Mixture of In-Context Experts'' (MoICE) to address this challenge. MoICE comprises two key components: a router integrated into each attention head within LLMs and a lightweight router-only training optimization strategy: (1) MoICE views each RoPE angle as an `in-context' expert, demonstrated to be capable of directing the attention of a head to specific contextual positions. Consequently, each attention head flexibly processes tokens using multiple RoPE angles dynamically selected by the router to attend to the needed positions. This approach mitigates the risk of overlooking essential contextual information. (2) The router-only training strategy entails freezing LLM parameters and exclusively updating routers for only a few steps. When applied to open-source LLMs including Llama and Mistral, MoICE surpasses prior methods across multiple tasks on long context understanding and generation, all while maintaining commendable inference efficiency. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 14 pages, 5 figures

arXiv:2406.19392 [pdf, other]

ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

Authors: Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen-Chun Chen, Yu-Chiang Frank Wang

Abstract: We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across vi… ▽ More We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning. △ Less

Submitted 2 July, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

Comments: Project page: https://rextime.github.io/

arXiv:2406.18591 [pdf, other]

Composition Vision-Language Understanding via Segment and Depth Anything Model

Authors: Mingxiao Huo, Pengliang Ji, Haotian Lin, Junchen Liu, Yixiao Wang, Yijun Chen

Abstract: We introduce a pioneering unified library that leverages depth anything, segment anything models to augment neural comprehension in language-vision model zero-shot understanding. This library synergizes the capabilities of the Depth Anything Model (DAM), Segment Anything Model (SAM), and GPT-4V, enhancing multimodal tasks such as vision-question-answering (VQA) and composition reasoning. Through t… ▽ More We introduce a pioneering unified library that leverages depth anything, segment anything models to augment neural comprehension in language-vision model zero-shot understanding. This library synergizes the capabilities of the Depth Anything Model (DAM), Segment Anything Model (SAM), and GPT-4V, enhancing multimodal tasks such as vision-question-answering (VQA) and composition reasoning. Through the fusion of segmentation and depth analysis at the symbolic instance level, our library provides nuanced inputs for language models, significantly advancing image interpretation. Validated across a spectrum of in-the-wild real-world images, our findings showcase progress in vision-language models through neural-symbolic integration. This novel approach melds visual and language analysis in an unprecedented manner. Overall, our library opens new directions for future research aimed at decoding the complexities of the real world through advanced multimodal technologies and our code is available at \url{https://github.com/AnthonyHuo/SAM-DAM-for-Compositional-Reasoning}. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.16771 [pdf, other]

An antiferromagnetic diode effect in even-layered MnBi2Te4

Authors: Anyuan Gao, Shao-Wen Chen, Barun Ghosh, Jian-Xiang Qiu, Yu-Fei Liu, Yugo Onishi, Chaowei Hu, Tiema Qian, Damien Bérubé, Thao Dinh, Houchen Li, Christian Tzschaschel, Seunghyun Park, Tianye Huang, Shang-Wei Lien, Zhe Sun, Sheng-Chin Ho, Bahadur Singh, Kenji Watanabe, Takashi Taniguchi, David C. Bell, Arun Bansil, Hsin Lin, Tay-Rong Chang, Amir Yacoby , et al. (4 additional authors not shown)

Abstract: In a PN junction, the separation between positive and negative charges leads to diode transport. In the past few years, the intrinsic diode transport in noncentrosymmetric polar conductors has attracted great interest, because it suggests novel nonlinear applications and provides a symmetry-sensitive probe of Fermi surface. Recently, such studies have been extended to noncentrosymmetric supercondu… ▽ More In a PN junction, the separation between positive and negative charges leads to diode transport. In the past few years, the intrinsic diode transport in noncentrosymmetric polar conductors has attracted great interest, because it suggests novel nonlinear applications and provides a symmetry-sensitive probe of Fermi surface. Recently, such studies have been extended to noncentrosymmetric superconductors, realizing the superconducting diode effect. Here, we show that, even in a centrosymmetric crystal without directional charge separation, the spins of an antiferromagnet (AFM) can generate a spatial directionality, leading to an AFM diode effect. We observe large second-harmonic transport in a nonlinear electronic device enabled by the compensated AFM state of even-layered MnBi2Te4. We also report a novel electrical sum-frequency generation (SFG), which has been rarely explored in contrast to the well-known optical SFG in wide-gap insulators. We demonstrate that the AFM enables an in-plane field-effect transistor and harvesting of wireless electromagnetic energy. The electrical SFG establishes a powerful method to study nonlinear electronics built by quantum materials. The AFM diode effect paves the way for potential device concepts including AFM logic circuits, self-powered AFM spintronics, and other applications that potentially bridge nonlinear electronics with AFM spintronics. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 33+8 pages, 14+2 figures

arXiv:2406.15881 [pdf, other]

Fast Tree-Field Integrators: From Low Displacement Rank to Topological Transformers

Authors: Krzysztof Choromanski, Arijit Sehanobish, Somnath Basu Roy Chowdhury, Han Lin, Avinava Dubey, Tamas Sarlos, Snigdha Chaturvedi

Abstract: We present a new class of fast polylog-linear algorithms based on the theory of structured matrices (in particular low displacement rank) for integrating tensor fields defined on weighted trees. Several applications of the resulting fast tree-field integrators (FTFIs) are presented, including (a) approximation of graph metrics with tree metrics, (b) graph classification, (c) modeling on meshes, an… ▽ More We present a new class of fast polylog-linear algorithms based on the theory of structured matrices (in particular low displacement rank) for integrating tensor fields defined on weighted trees. Several applications of the resulting fast tree-field integrators (FTFIs) are presented, including (a) approximation of graph metrics with tree metrics, (b) graph classification, (c) modeling on meshes, and finally (d) Topological Transformers (TTs) (Choromanski et al., 2022) for images. For Topological Transformers, we propose new relative position encoding (RPE) masking mechanisms with as few as three extra learnable parameters per Transformer layer, leading to 1.0-1.5%+ accuracy gains. Importantly, most of FTFIs are exact methods, thus numerically equivalent to their brute-force counterparts. When applied to graphs with thousands of nodes, those exact algorithms provide 5.7-13x speedups. We also provide an extensive theoretical analysis of our methods. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: Preprint. Comments welcome

arXiv:2406.15024 [pdf, other]

Thermal activated detection of dark particles in a weakly coupled quantum Ising ladder

Authors: Yunjing Gao, Jiahao Yang, Huihang Lin, Rong Yu, Jianda Wu

Abstract: The Ising$_h^2$ integrable field theory, which emerges when two quantum critical Ising chains are weakly coupled, possesses eight types of relativistic particles whose mass spectrum and scattering matrices are organized by the $\mathcal{D}_8^{(1)}$ algebra. It is predicted that all odd-parity particles are dark and cannot be directly excited from the ground state. This makes these dark particles h… ▽ More The Ising$_h^2$ integrable field theory, which emerges when two quantum critical Ising chains are weakly coupled, possesses eight types of relativistic particles whose mass spectrum and scattering matrices are organized by the $\mathcal{D}_8^{(1)}$ algebra. It is predicted that all odd-parity particles are dark and cannot be directly excited from the ground state. This makes these dark particles hard to be detected. Here, we study the local dynamical spin structure factor of the model at low-frequencies and low-temperatures. In contrast to the invisibility of the dark particles in THz spectroscopy or inelastic neutron scattering measurement, we find that the lightest dark particle is detectable, manifested as a thermal activation gap in nuclear magnetic resonance measurements. Our results provide a practical criterion for verifying the existence of dark particles. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 6 pages, 4 figures

arXiv:2406.13231 [pdf, other]

Tight Lower Bounds for Directed Cut Sparsification and Distributed Min-Cut

Authors: Yu Cheng, Max Li, Honghao Lin, Zi-Yi Tai, David P. Woodruff, Jason Zhang

Abstract: In this paper, we consider two fundamental cut approximation problems on large graphs. We prove new lower bounds for both problems that are optimal up to logarithmic factors. The first problem is to approximate cuts in balanced directed graphs. In this problem, the goal is to build a data structure that $(1 \pm ε)$-approximates cut values in graphs with $n$ vertices. For arbitrary directed graph… ▽ More In this paper, we consider two fundamental cut approximation problems on large graphs. We prove new lower bounds for both problems that are optimal up to logarithmic factors. The first problem is to approximate cuts in balanced directed graphs. In this problem, the goal is to build a data structure that $(1 \pm ε)$-approximates cut values in graphs with $n$ vertices. For arbitrary directed graphs, such a data structure requires $Ω(n^2)$ bits even for constant $ε$. To circumvent this, recent works study $β$-balanced graphs, meaning that for every directed cut, the total weight of edges in one direction is at most $β$ times that in the other direction. We consider two models: the {\em for-each} model, where the goal is to approximate each cut with constant probability, and the {\em for-all} model, where all cuts must be preserved simultaneously. We improve the previous $Ω(n \sqrt{β/ε})$ lower bound to $\tildeΩ(n \sqrtβ/ε)$ in the for-each model, and we improve the previous $Ω(n β/ε)$ lower bound to $Ω(n β/ε^2)$ in the for-all model. This resolves the main open questions of (Cen et al., ICALP, 2021). The second problem is to approximate the global minimum cut in a local query model, where we can only access the graph via degree, edge, and adjacency queries. We improve the previous $Ω\bigl(\frac{m}{k}\bigr)$ query complexity lower bound to $Ω\bigl(\min\{m, \frac{m}{ε^2 k}\}\bigr)$ for this problem, where $m$ is the number of edges, $k$ is the size of the minimum cut, and we seek a $(1+ε)$-approximation. In addition, we show that existing upper bounds with slight modifications match our lower bound up to logarithmic factors. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.12718 [pdf, other]

AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

Authors: Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Guang Dai, Ping Chen, Shijian Lu

Abstract: Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of objec… ▽ More Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of object hallucinations. Specifically, LVLMs predominantly attend to prompt-independent global image features, while failing to capture prompt-relevant local features, consequently undermining the visual grounding capacity of LVLMs and leading to hallucinations. To this end, we propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates object hallucinations by exploring an ensemble of global features for response generation and local features for visual discrimination simultaneously. Our approach exhibits an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is reserved while irrelevant distractions are masked. With the augmented view, a calibrated decoding distribution can be derived by integrating generative global features from the original image and discriminative local features from the augmented image. Extensive experiments show that AGLA consistently mitigates object hallucinations and enhances general perception capability for LVLMs across various discriminative and generative benchmarks. Our code will be released at https://github.com/Lackel/AGLA. △ Less

Submitted 21 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12653 [pdf, ps, other]

Single-photon and two-photon blockade in a three-wave mixing system with a two-level atom

Authors: HongYu Lin

Abstract: This paper discusses conventional photon blockade (CPB) and two-photon blockade (2PB) in a three-wave mixing system embedded with a two-level atom in the high-frequency cavity. Analytical conditions for achieving CPB and 2PB are obtained by analyzing the eigenvalues of the system Hamiltonian. Numerical solutions, derived by solving the master equation in a truncated Fock space, are consistent with… ▽ More This paper discusses conventional photon blockade (CPB) and two-photon blockade (2PB) in a three-wave mixing system embedded with a two-level atom in the high-frequency cavity. Analytical conditions for achieving CPB and 2PB are obtained by analyzing the eigenvalues of the system Hamiltonian. Numerical solutions, derived by solving the master equation in a truncated Fock space, are consistent with the analytical conditions. Detailed analysis of system parameters reveals the influence of the embedded atom on achieving different types of photon blockade. Unlike previous schemes, this system can achieve single-photon blockade simultaneously in three photon modes. Additionally, by adjusting the coupling coefficient between the atom and high-frequency mode photons, the system can switch between single-photon blockade and two-photon blockade in the high-frequency mode. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12386 [pdf, other]

IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models

Authors: Qiyao Wang, Jianguo Huang, Shule Lu, Yuan Lin, Kan Xu, Liang Yang, Hongfei Lin

Abstract: The rapid development of Large Language Models (LLMs) in vertical domains, including intellectual property (IP), lacks a specific evaluation benchmark for assessing their understanding, application, and reasoning abilities. To fill this gap, we introduce IPEval, the first evaluation benchmark tailored for IP agency and consulting tasks. IPEval comprises 2657 multiple-choice questions across four m… ▽ More The rapid development of Large Language Models (LLMs) in vertical domains, including intellectual property (IP), lacks a specific evaluation benchmark for assessing their understanding, application, and reasoning abilities. To fill this gap, we introduce IPEval, the first evaluation benchmark tailored for IP agency and consulting tasks. IPEval comprises 2657 multiple-choice questions across four major dimensions: creation, application, protection, and management of IP. These questions span patent rights (inventions, utility models, designs), trademarks, copyrights, trade secrets, and other related laws. Evaluation methods include zero-shot, 5-few-shot, and Chain of Thought (CoT) for seven LLM types, predominantly in English or Chinese. Results show superior English performance by models like GPT series and Qwen series, while Chinese-centric LLMs excel in Chinese tests, albeit specialized IP LLMs lag behind general-purpose ones. Regional and temporal aspects of IP underscore the need for LLMs to grasp legal nuances and evolving laws. IPEval aims to accurately gauge LLM capabilities in IP and spur development of specialized models. Website: \url{https://ipeval.github.io/} △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12221 [pdf, other]

On-Policy Fine-grained Knowledge Feedback for Hallucination Mitigation

Authors: Xueru Wen, Xinyu Lu, Xinyan Guan, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun

Abstract: Hallucination occurs when large language models (LLMs) exhibit behavior that deviates from the boundaries of their knowledge during the response generation process. Previous learning-based methods focus on detecting knowledge boundaries and finetuning models with instance-level feedback, but they suffer from inaccurate signals due to off-policy data sampling and coarse-grained feedback. In this pa… ▽ More Hallucination occurs when large language models (LLMs) exhibit behavior that deviates from the boundaries of their knowledge during the response generation process. Previous learning-based methods focus on detecting knowledge boundaries and finetuning models with instance-level feedback, but they suffer from inaccurate signals due to off-policy data sampling and coarse-grained feedback. In this paper, we introduce \textit{\b{R}einforcement \b{L}earning \b{f}or \b{H}allucination} (RLFH), a fine-grained feedback-based online reinforcement learning method for hallucination mitigation. Unlike previous learning-based methods, RLFH enables LLMs to explore the boundaries of their internal knowledge and provide on-policy, fine-grained feedback on these explorations. To construct fine-grained feedback for learning reliable generation behavior, RLFH decomposes the outcomes of large models into atomic facts, provides statement-level evaluation signals, and traces back the signals to the tokens of the original responses. Finally, RLFH adopts the online reinforcement algorithm with these token-level rewards to adjust model behavior for hallucination mitigation. For effective on-policy optimization, RLFH also introduces an LLM-based fact assessment framework to verify the truthfulness and helpfulness of atomic facts without human intervention. Experiments on HotpotQA, SQuADv2, and Biography benchmarks demonstrate that RLFH can balance their usage of internal knowledge during the generation process to eliminate the hallucination behavior of LLMs. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.12207 [pdf, other]

doi 10.1103/PhysRevB.109.235133

The Green's function Monte Carlo combined with projected entangled pair state approach to the frustrated $J_1$-$J_2$ Heisenberg model

Authors: He-Yu Lin, Yibin Guo, Rong-Qiang He, Z. Y. Xie, Zhong-Yi Lu

Abstract: The tensor network algorithm, a family of prevalent numerical methods for quantum many-body problems, aptly captures the entanglement properties intrinsic to quantum systems, enabling precise representation of quantum states. However, its computational cost is notably high, particularly in calculating physical observables like correlation functions. To surmount the computational challenge and enha… ▽ More The tensor network algorithm, a family of prevalent numerical methods for quantum many-body problems, aptly captures the entanglement properties intrinsic to quantum systems, enabling precise representation of quantum states. However, its computational cost is notably high, particularly in calculating physical observables like correlation functions. To surmount the computational challenge and enhance efficiency, we propose integrating the Green's function Monte Carlo (GFMC) method with the projected entangled pair state (PEPS) ansatz. This approach combines the high-efficiency characteristics of Monte Carlo with the sign-free nature of tensor network states and proves effective in addressing the computational bottleneck. To showcase its prowess, we apply this hybrid approach to investigate the antiferromagnetic $J_1$-$J_2$ Heisenberg model on the square lattice, a model notorious for its sign problem in quantum Monte Carlo simulations. Our results reveal a substantial improvement in the accuracy of ground-state energy when utilizing a preliminary PEPS as the guiding wave function for GFMC. By calculating the structure factor and spin-spin correlation functions, we further characterize the phase diagram, identifying a possible columnar valence-bond state phase within the intermediate parameter range of $0.52 < J_2/J_1 < 0.58$. This comprehensive study underscores the efficacy of our combined approach, demonstrating its ability to accurately simulate frustrated quantum spin systems while ensuring computational efficiency. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 11 pages, 15 figures

Journal ref: Phys. Rev. B 109, 235133 (2024)

arXiv:2406.12017 [pdf, other]

Sparsity-Constraint Optimization via Splicing Iteration

Authors: Zezhi Wang, Jin Zhu, Junxian Zhu, Borui Tang, Hongmei Lin, Xueqin Wang

Abstract: Sparsity-constraint optimization has wide applicability in signal processing, statistics, and machine learning. Existing fast algorithms must burdensomely tune parameters, such as the step size or the implementation of precise stop criteria, which may be challenging to determine in practice. To address this issue, we develop an algorithm named Sparsity-Constraint Optimization via sPlicing itEratio… ▽ More Sparsity-constraint optimization has wide applicability in signal processing, statistics, and machine learning. Existing fast algorithms must burdensomely tune parameters, such as the step size or the implementation of precise stop criteria, which may be challenging to determine in practice. To address this issue, we develop an algorithm named Sparsity-Constraint Optimization via sPlicing itEration (SCOPE) to optimize nonlinear differential objective functions with strong convexity and smoothness in low dimensional subspaces. Algorithmically, the SCOPE algorithm converges effectively without tuning parameters. Theoretically, SCOPE has a linear convergence rate and converges to a solution that recovers the true support set when it correctly specifies the sparsity. We also develop parallel theoretical results without restricted-isometry-property-type conditions. We apply SCOPE's versatility and power to solve sparse quadratic optimization, learn sparse classifiers, and recover sparse Markov networks for binary variables. The numerical results on these specific tasks reveal that SCOPE perfectly identifies the true support set with a 10--1000 speedup over the standard exact solver, confirming SCOPE's algorithmic and theoretical merits. Our open-source Python package skscope based on C++ implementation is publicly available on GitHub, reaching a ten-fold speedup on the competing convex relaxation methods implemented by the cvxpy library. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 34 pages

arXiv:2406.11514 [pdf, other]

Counterfactual Debating with Preset Stances for Hallucination Elimination of LLMs

Authors: Yi Fang, Moxin Li, Wenjie Wang, Hui Lin, Fuli Feng

Abstract: Large Language Models (LLMs) excel in various natural language processing tasks but struggle with hallucination issues. Existing solutions have considered utilizing LLMs' inherent reasoning abilities to alleviate hallucination, such as self-correction and diverse sampling methods. However, these methods often overtrust LLMs' initial answers due to inherent biases. The key to alleviating this issue… ▽ More Large Language Models (LLMs) excel in various natural language processing tasks but struggle with hallucination issues. Existing solutions have considered utilizing LLMs' inherent reasoning abilities to alleviate hallucination, such as self-correction and diverse sampling methods. However, these methods often overtrust LLMs' initial answers due to inherent biases. The key to alleviating this issue lies in overriding LLMs' inherent biases for answer inspection. To this end, we propose a CounterFactual Multi-Agent Debate (CFMAD) framework. CFMAD presets the stances of LLMs to override their inherent biases by compelling LLMs to generate justifications for a predetermined answer's correctness. The LLMs with different predetermined stances are engaged with a skeptical critic for counterfactual debate on the rationality of generated justifications. Finally, the debate process is evaluated by a third-party judge to determine the final answer. Extensive experiments on four datasets of three tasks demonstrate the superiority of CFMAD over existing methods. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11288 [pdf, other]

MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

Authors: Shengkang Wang, Hongzhan Lin, Ziyang Luo, Zhen Ye, Guang Chen, Jing Ma

Abstract: Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from actual facts due to inherent bias or incorre… ▽ More Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from actual facts due to inherent bias or incorrect inference. To address this issue, we introduce MFC-Bench, a rigorous and comprehensive benchmark designed to evaluate the factual accuracy of LVLMs across three tasks: Manipulation, Out-of-Context, and Veracity Classification. Through our evaluation on MFC-Bench, we benchmarked 12 diverse and representative LVLMs, uncovering that current models still fall short in multimodal fact-checking and demonstrate insensitivity to various forms of manipulated content. We hope that MFC-Bench could raise attention to the trustworthy artificial intelligence potentially assisted by LVLMs in the future. The MFC-Bench and accompanying resources are publicly accessible at https://github.com/wskbest/MFC-Bench, contributing to ongoing research in the multimodal fact-checking field. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 22 pages, 8 figures

arXiv:2406.10840 [pdf, other]

CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph

Authors: Haitao Lin, Guojiang Zhao, Odin Zhang, Yufei Huang, Lirong Wu, Zicheng Liu, Siyuan Li, Cheng Tan, Zhifeng Gao, Stan Z. Li

Abstract: Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair compariso… ▽ More Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair comparisons and inconclusive insights. To address this dilemma, we propose CBGBench, a comprehensive benchmark for SBDD, that unifies the task as a generative heterogeneous graph completion, analogous to fill-in-the-blank of the 3D complex binding graph. By categorizing existing methods based on their attributes, CBGBench facilitates a modular and extensible framework that implements various cutting-edge methods. Secondly, a single task on \textit{de novo} molecule generation can hardly reflect their capabilities. To broaden the scope, we have adapted these models to a range of tasks essential in drug design, which are considered sub-tasks within the graph fill-in-the-blank tasks. These tasks include the generative designation of \textit{de novo} molecules, linkers, fragments, scaffolds, and sidechains, all conditioned on the structures of protein pockets. Our evaluations are conducted with fairness, encompassing comprehensive perspectives on interaction, chemical properties, geometry authenticity, and substructure validity. We further provide the pre-trained versions of the state-of-the-art models and deep insights with analysis from empirical studies. The codebase for CBGBench is publicly accessible at \url{https://github.com/Edapinenut/CBGBench}. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 9 pages main context

arXiv:2406.10280 [pdf, other]

Transferable Embedding Inversion Attack: Uncovering Privacy Risks in Text Embeddings without Model Queries

Authors: Yu-Hsiang Huang, Yuche Tsai, Hsiang Hsiao, Hong-Yi Lin, Shou-De Lin

Abstract: This study investigates the privacy risks associated with text embeddings, focusing on the scenario where attackers cannot access the original embedding model. Contrary to previous research requiring direct model access, we explore a more realistic threat model by developing a transfer attack method. This approach uses a surrogate model to mimic the victim model's behavior, allowing the attacker t… ▽ More This study investigates the privacy risks associated with text embeddings, focusing on the scenario where attackers cannot access the original embedding model. Contrary to previous research requiring direct model access, we explore a more realistic threat model by developing a transfer attack method. This approach uses a surrogate model to mimic the victim model's behavior, allowing the attacker to infer sensitive information from text embeddings without direct access. Our experiments across various embedding models and a clinical dataset demonstrate that our transfer attack significantly outperforms traditional methods, revealing the potential privacy vulnerabilities in embedding technologies and emphasizing the need for enhanced security measures. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted at ACL 2024 Main Conference

arXiv:2406.09817 [pdf, other]

Efficient and Precise Force Field Optimization for Biomolecules Using DPA-2

Authors: Junhan Chang, Duo Zhang, Yuqing Deng, Hongrui Lin, Zhirong Liu, Linfeng Zhang, Hang Zheng, Xinyan Wang

Abstract: Molecular simulations are essential tools in computational chemistry, enabling the prediction and understanding of molecular interactions and thermodynamic properties of biomolecules. However, traditional force fields face significant challenges in accurately representing novel molecules and complex chemical environments due to the labor-intensive process of manually setting optimization parameter… ▽ More Molecular simulations are essential tools in computational chemistry, enabling the prediction and understanding of molecular interactions and thermodynamic properties of biomolecules. However, traditional force fields face significant challenges in accurately representing novel molecules and complex chemical environments due to the labor-intensive process of manually setting optimization parameters and the high computational cost of quantum mechanical calculations. To overcome these difficulties, we fine-tuned a high-accuracy DPA-2 pre-trained model and applied it to optimize force field parameters on-the-fly, significantly reducing computational costs. Our method combines this fine-tuned DPA-2 model with a node-embedding-based similarity metric, allowing seamless augmentation to new chemical species without manual intervention. We applied this process to the TYK2 inhibitor and PTP1B systems and demonstrated its effectiveness through the improvement of free energy perturbation calculation results. This advancement contributes valuable insights and tools for the computational chemistry community. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.09800 [pdf, ps, other]

R-matrix presentation of quantum affine superalgebra for type $\mathfrak{osp}(2m+1|2n)$

Authors: Xianghua Wu, Hongda Lin, Honglian Zhang

Abstract: In the current paper, we extend our prior research [X. Wu, H. Lin and H. Zhang, Braid group action and quantum affine superalgebra for type $\mathfrak{osp}(2m+1|2n)$. Preprint, (2024)], which introduced the Drinfeld presentation of the quantum affine superalgebra associated with the orthosymplectic Lie superalgebra $\mathfrak{osp}$ $(2m+1|2n)$ for $m>0$. Based on this work, our present investigati… ▽ More In the current paper, we extend our prior research [X. Wu, H. Lin and H. Zhang, Braid group action and quantum affine superalgebra for type $\mathfrak{osp}(2m+1|2n)$. Preprint, (2024)], which introduced the Drinfeld presentation of the quantum affine superalgebra associated with the orthosymplectic Lie superalgebra $\mathfrak{osp}$ $(2m+1|2n)$ for $m>0$. Based on this work, our present investigation focuses on its R-matrix presentation. Moreover, we establish an explicit isomorphism between the R-matrix presentation and the Drinfeld presentation. In particular, our contribution extends the investigations of Jing, Liu, and Molev concerning quantum affine algebra in type BCD to the domain of supersymmetry [arXiv:1903.00204; arXiv:1911.03496]. △ Less

Submitted 17 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.09342 [pdf, other]

Wavefront shaping simulations with augmented partial factorization

Authors: Ho-Chun Lin, Zeyu Wang, Chia Wei Hsu

Abstract: Wavefront shaping can tailor multipath interference to control multiple scattering of waves in complex optical systems. However, full-wave simulations that capture multiple scattering are computationally demanding given the large system size and the large number of input channels. Recently, an "augmented partial factorization" (APF) method was proposed to significantly speed-up such full-wave simu… ▽ More Wavefront shaping can tailor multipath interference to control multiple scattering of waves in complex optical systems. However, full-wave simulations that capture multiple scattering are computationally demanding given the large system size and the large number of input channels. Recently, an "augmented partial factorization" (APF) method was proposed to significantly speed-up such full-wave simulations. In this tutorial, we illustrate how to perform wavefront shaping simulations with the APF method using the open-source frequency-domain electromagnetic scattering solver MESTI. We present the foundational concepts and then walk through four examples: computing the scattering matrix of a slab with random permittivities, open high-transmission channels through disorder, focusing inside disorder with phase conjugation, and reflection matrix computation in a spatial focused-beam basis. The goal is to lower the barrier for researchers to use simulations to explore the rich phenomena enabled by wavefront shaping. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.09270 [pdf, other]

Discovery and Extensive Follow-Up of SN 2024ggi, a nearby type IIP supernova in NGC 3621

Authors: Ting-Wan Chen, Sheng Yang, Shubham Srivastav, Takashi J. Moriya, Stephen J. Smartt, Sofia Rest, Armin Rest, Hsing Wen Lin, Hao-Yu Miao, Yu-Chi Cheng, Amar Aryan, Chia-Yu Cheng, Morgan Fraser, Li-Ching Huang, Meng-Han Lee, Cheng-Han Lai, Yu Hsuan Liu, Aiswarya Sankar. K, Ken W. Smith, Heloise F. Stevance, Ze-Ning Wang, Joseph P. Anderson, Charlotte R. Angus, Thomas de Boer, Kenneth Chambers , et al. (23 additional authors not shown)

Abstract: We present the discovery and early observations of the nearby Type II supernova (SN) 2024ggi in NGC 3621 at 6.64 +/- 0.3 Mpc. The SN was caught 5.8 (+1.9 -2.9) hours after its explosion by the ATLAS survey. Early-phase, high-cadence, and multi-band photometric follow-up was performed by the Kinder (Kilonova Finder) project, collecting over 1000 photometric data points within a week. The combined o… ▽ More We present the discovery and early observations of the nearby Type II supernova (SN) 2024ggi in NGC 3621 at 6.64 +/- 0.3 Mpc. The SN was caught 5.8 (+1.9 -2.9) hours after its explosion by the ATLAS survey. Early-phase, high-cadence, and multi-band photometric follow-up was performed by the Kinder (Kilonova Finder) project, collecting over 1000 photometric data points within a week. The combined o- and r-band light curves show a rapid rise of 3.3 magnitudes in 13.7 hours, much faster than SN 2023ixf (another recent, nearby, and well-observed SN II). Between 13.8 and 18.8 hours after explosion SN 2024ggi became bluer, with u-g colour dropping from 0.53 to 0.15 mag. The rapid blueward evolution indicates a wind shock breakout (SBO) scenario. No hour-long brightening expected for the SBO from a bare stellar surface was detected during our observations. The classification spectrum, taken 17 hours after the SN explosion, shows flash features of high-ionization species such as Balmer lines, He I, C III, and N III. Detailed light curve modeling reveals critical insights into the properties of the circumstellar material (CSM). Our favoured model has an explosion energy of 2 x 10^51 erg, a mass-loss rate of 10^-3 solar_mass/yr (with an assumed 10 km/s wind), and a confined CSM radius of 6 x 10^14 cm. The corresponding CSM mass is 0.4 solar_mass. Comparisons with SN 2023ixf highlight that SN 2024ggi has a smaller CSM density, resulting in a faster rise and fainter UV flux. The extensive dataset and the involvement of citizen astronomers underscore that a collaborative network is essential for SBO searches, leading to more precise and comprehensive SN characterizations. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 11 pages, 5 figures in manuscript, 6 pages in appendix, submitted to ApJL

arXiv:2406.08102 [pdf, other]

Adversarial Patch for 3D Local Feature Extractor

Authors: Yu Wen Pao, Li Chang Lai, Hong-Yi Lin

Abstract: Local feature extractors are the cornerstone of many computer vision tasks. However, their vulnerability to adversarial attacks can significantly compromise their effectiveness. This paper discusses approaches to attack sophisticated local feature extraction algorithms and models to achieve two distinct goals: (1) forcing a match between originally non-matching image regions, and (2) preventing a… ▽ More Local feature extractors are the cornerstone of many computer vision tasks. However, their vulnerability to adversarial attacks can significantly compromise their effectiveness. This paper discusses approaches to attack sophisticated local feature extraction algorithms and models to achieve two distinct goals: (1) forcing a match between originally non-matching image regions, and (2) preventing a match between originally matching regions. At the end of the paper, we discuss the performance and drawbacks of different patch generation methods. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07806 [pdf, other]

Probing the Shock Breakout Signal of SN 2024ggi from the Transformation of Early Flash Spectroscopy

Authors: Jujia Zhang, Luc Dessart, Xiaofeng Wang, Qian Zhai, Yi Yang, Liping Li, Han Lin, Giorgio Valerin, Yongzhi Cai, Zhen Guo, Lingzhi Wang, Zeyi Zhao, Zhenyu Wang, Shengyu Yan

Abstract: We present early-time, hour-to-day cadence spectroscopy of the nearby type II supernova (SN II) 2024ggi, which was discovered at a phase when the SN shock just emerged from the red-supergiant (RSG) progenitor star. Over the first few days after the first light, SN 2024ggi exhibited prominent narrow emission lines formed through intense and persistent photoionization of the nearby circumstellar mat… ▽ More We present early-time, hour-to-day cadence spectroscopy of the nearby type II supernova (SN II) 2024ggi, which was discovered at a phase when the SN shock just emerged from the red-supergiant (RSG) progenitor star. Over the first few days after the first light, SN 2024ggi exhibited prominent narrow emission lines formed through intense and persistent photoionization of the nearby circumstellar material (CSM). In the first 63 hours, spectral lines of He, C, N, and O revealed a rapid rise in ionization, as a result of the progressive sweeping-up of the CSM by the shock. The duration of the IIn-like spectra indicates a dense and relatively confined CSM distribution extending up to $\sim 4 \times 10^{14}$ cm. Spectral modeling reveals a CSM mass loss rate at this region exceeding $5 \times\, 10^{-3}${\rm M}_{\odot} yr$^{-1}$ is required to reproduce low-ionization emissions, which dramatically exceeds that of an RSG. Analyzing H$α$ emission shift implies the velocity of the unshocked outer CSM to be between 20 and 40 km s$^{-1}$, matching the typical wind velocity of an RSG. The differences between the inner and outer layers of the CSM and an RSG progenitor highlight a complex mass loss history before the explosion of SN 2024ggi. △ Less

Submitted 29 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: 10 pages and 5 figures in the main text (16 pages and 9 figures in total). Accepted for publication in ApJL

arXiv:2406.07540 [pdf, other]

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

Authors: Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou

Abstract: Recent controllable generation approaches such as FreeControl and Diffusion Self-guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexib… ▽ More Recent controllable generation approaches such as FreeControl and Diffusion Self-guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints. In particular, Ctrl-X supports novel structure and appearance control with arbitrary condition images of any modality, exhibits superior image quality and appearance transfer compared to existing works, and provides instant plug-and-play functionality to any T2I and text-to-video (T2V) diffusion model. See our project page for an overview of the results: https://genforce.github.io/ctrl-x △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 18 pages, 11 figures, see project page at https://genforce.github.io/ctrl-x

arXiv:2406.07307 [pdf, ps, other]

The effective cone conjecture for Calabi--Yau pairs

Authors: Cécile Gachet, Hsueh-Yung Lin, Isabel Stenger, Long Wang

Abstract: We formulate an {\it effective cone conjecture} for klt Calabi--Yau pairs $(X,Δ)$, pertaining to the structure of the cone of effective divisors $\mathrm{Eff}(X)$ modulo the action of the subgroup of pseudo-automorphisms $\mathrm{PsAut}(X,Δ)$. Assuming the existence of good minimal models in dimension $\dim(X)$, known to hold in dimension up to $3$, we prove that the effective cone conjecture for… ▽ More We formulate an {\it effective cone conjecture} for klt Calabi--Yau pairs $(X,Δ)$, pertaining to the structure of the cone of effective divisors $\mathrm{Eff}(X)$ modulo the action of the subgroup of pseudo-automorphisms $\mathrm{PsAut}(X,Δ)$. Assuming the existence of good minimal models in dimension $\dim(X)$, known to hold in dimension up to $3$, we prove that the effective cone conjecture for $(X,Δ)$ is equivalent to the Kawamata--Morrison--Totaro movable cone conjecture for $(X,Δ)$. As an application, we show that the movable cone conjecture unconditionally holds for the smooth Calabi--Yau threefolds introduced by Schoen and studied by Namikawa, Grassi and Morrison. We also show that for such a Calabi--Yau threefold $X$, all of its minimal models, apart from $X$ itself, have rational polyhedral nef cones. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 31 pages

arXiv:2406.06858 [pdf, other]

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Authors: Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu

Abstract: Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation… ▽ More Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects. △ Less

Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06727 [pdf, other]

Full transmission of vectorial waves through 3D multiple-scattering media

Authors: Ho-Chun Lin, Chia Wei Hsu

Abstract: A striking prediction from the random matrix theory in mesoscopic physics is the existence of "open channels": waves that can use multipath interference to achieve perfect transmission across an opaque disordered medium even in the multiple-scattering regime. Realization of such open channels requires a coherent control of the complete incident wavefront. To date, the open channels have only been… ▽ More A striking prediction from the random matrix theory in mesoscopic physics is the existence of "open channels": waves that can use multipath interference to achieve perfect transmission across an opaque disordered medium even in the multiple-scattering regime. Realization of such open channels requires a coherent control of the complete incident wavefront. To date, the open channels have only been demonstrated in scalar two-dimensional (2D) structures, both experimentally and with numerical studies. Here, we utilize a recently proposed "augmented partial factorization" full-wave simulation method to compute the scattering matrix from 3D vectorial Maxwell's equations and demonstrate the existence of open channels in 3D disordered media. We examine the spatial profile of such open channels, demonstrate the existence of a bimodal transmission eigenvalue distribution with full control, and study the effects of incomplete polarization control and of a finite illumination area. This study confirms the validity of the random matrix theory in vectorial systems. The simulation framework provides full access to the complex multi-channel wave transport in 3D disordered systems, filling the gap left by experimental capabilities. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06640 [pdf]

A high-performance reconstruction method for partially coherent ptychography

Authors: Wenhui Xu, Shoucong Ning, Pengju Sheng, Huixiang Lin, Angus I Kirkland, Yong Peng, Fucai Zhang

Abstract: Ptychography is now integrated as a tool in mainstream microscopy allowing quantitative and high-resolution imaging capabilities over a wide field of view. However, its ultimate performance is inevitably limited by the available coherent flux when implemented using electrons or laboratory X-ray sources. We present a universal reconstruction algorithm with high tolerance to low coherence for both f… ▽ More Ptychography is now integrated as a tool in mainstream microscopy allowing quantitative and high-resolution imaging capabilities over a wide field of view. However, its ultimate performance is inevitably limited by the available coherent flux when implemented using electrons or laboratory X-ray sources. We present a universal reconstruction algorithm with high tolerance to low coherence for both far-field and near-field ptychography. The approach is practical for partial temporal and spatial coherence and requires no prior knowledge of the source properties. Our initial visible-light and electron data show that the method can dramatically improve the reconstruction quality and accelerate the convergence rate of the reconstruction. The approach also integrates well into existing ptychographic engines. It can also improve mixed-state and numerical monochromatisation methods, requiring a smaller number of coherent modes or lower dimensionality of Krylov subspace while providing more stable and faster convergence. We propose that this approach could have significant impact on ptychography of weakly scattering samples. △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.05862 [pdf, other]

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

Authors: Ziqiang Liu, Feiteng Fang, Xi Feng, Xinrun Du, Chenhao Zhang, Zekun Wang, Yuelin Bai, Qixuan Zhao, Liyang Fan, Chengguang Gan, Hongquan Lin, Jiaming Li, Yuansheng Ni, Haihong Wu, Yaswanth Narsupalli, Zhigang Zheng, Chengming Li, Xiping Hu, Ruifeng Xu, Xiaojun Chen, Min Yang, Jiaheng Liu, Ruibo Liu, Wenhao Huang, Ge Zhang , et al. (1 additional authors not shown)

Abstract: The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap,… ▽ More The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https://huggingface.co/datasets/m-a-p/II-Bench. △ Less

Submitted 11 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

Comments: 100 pages, 82 figures, add citations

arXiv:2406.05135 [pdf]

Smart Navigation System for Parking Assignment at Large Events: Incorporating Heterogeneous Driver Characteristics

Authors: Xi Cheng, Gaofeng Su, Siyuan Feng, Ke Liu, Chen Zhu, Hui Lin, Jilin Song, Jianan Chen

Abstract: Parking challenges escalate significantly during large events such as concerts or sports games, yet few studies address dynamic parking lot assignments for such occasions. This paper introduces a smart navigation system designed to optimize parking assignments swiftly during large events, utilizing a mixed search algorithm that accounts for the heterogeneous characteristics of drivers. We conducte… ▽ More Parking challenges escalate significantly during large events such as concerts or sports games, yet few studies address dynamic parking lot assignments for such occasions. This paper introduces a smart navigation system designed to optimize parking assignments swiftly during large events, utilizing a mixed search algorithm that accounts for the heterogeneous characteristics of drivers. We conducted simulations in the Berkeley city area during the "Big Game" to validate our system and demonstrate the benefits of our innovative parking assignment approach. △ Less

Submitted 14 May, 2024; originally announced June 2024.

arXiv:2406.05046 [pdf, other]

The Dark Energy Survey Supernova Program: Light curves and 5-Year data release

Authors: B. O. Sánchez, D. Brout, M. Vincenzi, M. Sako, K. Herner, R. Kessler, T. M. Davis, D. Scolnic, M. Acevedo, J. Lee, A. Möller, H. Qu, L. Kelsey, P. Wiseman, P. Armstrong, B. Rose, R. Camilleri, R. Chen, L. Galbany, E. Kovacs, C. Lidman, B. Popovic, M. Smith, M. Sullivan, M. Toy , et al. (60 additional authors not shown)

Abstract: We present $griz$ photometric light curves for the full 5 years of the Dark Energy Survey Supernova program (DES-SN), obtained with both forced Point Spread Function (PSF) photometry on Difference Images (DIFFIMG) performed during survey operations, and Scene Modelling Photometry (SMP) on search images processed after the survey. This release contains $31,636$ DIFFIMG and $19,706$ high-quality SMP… ▽ More We present $griz$ photometric light curves for the full 5 years of the Dark Energy Survey Supernova program (DES-SN), obtained with both forced Point Spread Function (PSF) photometry on Difference Images (DIFFIMG) performed during survey operations, and Scene Modelling Photometry (SMP) on search images processed after the survey. This release contains $31,636$ DIFFIMG and $19,706$ high-quality SMP light curves, the latter of which contains $1635$ photometrically-classified supernovae that pass cosmology quality cuts. This sample spans the largest redshift ($z$) range ever covered by a single SN survey ($0.1<z<1.13$) and is the largest single sample from a single instrument of SNe ever used for cosmological constraints. We describe in detail the improvements made to obtain the final DES-SN photometry and provide a comparison to what was used in the DES-SN3YR spectroscopically-confirmed SN Ia sample. We also include a comparative analysis of the performance of the SMP photometry with respect to the real-time DIFFIMG forced photometry and find that SMP photometry is more precise, more accurate, and less sensitive to the host-galaxy surface brightness anomaly. The public release of the light curves and ancillary data can be found at https://github.com/des-science/DES-SN5YR. Finally, we discuss implications for future transient surveys, such as the forthcoming Vera Rubin Observatory Legacy Survey of Space and Time (LSST). △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.04997 [pdf, ps, other]

On the social bias of speech self-supervised models

Authors: Yi-Cheng Lin, Tzu-Quan Lin, Hsi-Che Lin, Andy T. Liu, Hung-yi Lee

Abstract: Self-supervised learning (SSL) speech models have achieved remarkable performance in various tasks, yet the biased outcomes, especially affecting marginalized groups, raise significant concerns. Social bias refers to the phenomenon where algorithms potentially amplify disparate properties between social groups present in the data used for training. Bias in SSL models can perpetuate injustice by au… ▽ More Self-supervised learning (SSL) speech models have achieved remarkable performance in various tasks, yet the biased outcomes, especially affecting marginalized groups, raise significant concerns. Social bias refers to the phenomenon where algorithms potentially amplify disparate properties between social groups present in the data used for training. Bias in SSL models can perpetuate injustice by automating discriminatory patterns and reinforcing inequitable systems. This work reveals that prevalent SSL models inadvertently acquire biased associations. We probe how various factors, such as model architecture, size, and training methodologies, influence the propagation of social bias within these models. Finally, we explore the efficacy of debiasing SSL models through regularization techniques, specifically via model compression. Our findings reveal that employing techniques such as row-pruning and training wider, shallower models can effectively mitigate social bias within SSL model. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

Showing 1–50 of 2,887 results for author: Lin, H