subscribe to arXiv mailings

Scaling Diffusion Transformers to 16 Billion Parameters

Authors: Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang

Abstract: In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional ima… ▽ More In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional image generation, a deep analysis of experts specialization gains some interesting observations: (i) Expert selection shows preference with spatial position and denoising time step, while insensitive with different class-conditional information; (ii) As the MoE layers go deeper, the selection of experts gradually shifts from specific spacial position to dispersion and balance. (iii) Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. We attribute it to the diffusion process that first models the low-frequency spatial information and then high-frequency complex information. Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 16.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512$\times$512 resolution settings. The project page: https://github.com/feizc/DiT-MoE. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.11549 [pdf, other]

How Personality Traits Influence Negotiation Outcomes? A Simulation based on Large Language Models

Authors: Yin Jou Huang, Rafik Hadfi

Abstract: Psychological evidence reveals the influence of personality traits on decision-making. For instance, agreeableness is generally associated with positive outcomes in negotiations, whereas neuroticism is often linked to less favorable outcomes. This paper introduces a simulation framework centered on Large Language Model (LLM) agents endowed with synthesized personality traits. The agents negotiate… ▽ More Psychological evidence reveals the influence of personality traits on decision-making. For instance, agreeableness is generally associated with positive outcomes in negotiations, whereas neuroticism is often linked to less favorable outcomes. This paper introduces a simulation framework centered on Large Language Model (LLM) agents endowed with synthesized personality traits. The agents negotiate within bargaining domains and possess customizable personalities and objectives. The experimental results show that the behavioral tendencies of LLM-based simulations could reproduce behavioral patterns observed in human negotiations. The contribution is twofold. First, we propose a simulation methodology that investigates the alignment between the linguistic and economic capabilities of LLM agents. Secondly, we offer empirical insights into the strategic impact of Big-Five personality traits on the outcomes of bilateral negotiations. We also provide a case study based on synthesized bargaining dialogues to reveal intriguing behaviors, including deceitful and compromising behaviors. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 13 pages, 4 figures

arXiv:2407.11486 [pdf, other]

An efficient framework based on large foundation model for cervical cytopathology whole slide image screening

Authors: Jialong Huang, Gaojie Li, Shichao Kan, Jianfeng Liu, Yixiong Liang

Abstract: Current cervical cytopathology whole slide image (WSI) screening primarily relies on detection-based approaches, which are limited in performance due to the expense and time-consuming annotation process. Multiple Instance Learning (MIL), a weakly supervised approach that relies solely on bag-level labels, can effectively alleviate these challenges. Nonetheless, MIL commonly employs frozen pretrain… ▽ More Current cervical cytopathology whole slide image (WSI) screening primarily relies on detection-based approaches, which are limited in performance due to the expense and time-consuming annotation process. Multiple Instance Learning (MIL), a weakly supervised approach that relies solely on bag-level labels, can effectively alleviate these challenges. Nonetheless, MIL commonly employs frozen pretrained models or self-supervised learning for feature extraction, which suffers from low efficacy or inefficiency. In this paper, we propose an efficient framework for cervical cytopathology WSI classification using only WSI-level labels through unsupervised and weakly supervised learning. Given the sparse and dispersed nature of abnormal cells within cytopathological WSIs, we propose a strategy that leverages the pretrained foundation model to filter the top$k$ high-risk patches. Subsequently, we suggest parameter-efficient fine-tuning (PEFT) of a large foundation model using contrastive learning on the filtered patches to enhance its representation ability for task-specific signals. By training only the added linear adapters, we enhance the learning of patch-level features with substantially reduced time and memory consumption. Experiments conducted on the CSD and FNAC 2019 datasets demonstrate that the proposed method enhances the performance of various MIL methods and achieves state-of-the-art (SOTA) performance. The code and trained models are publicly available at https://github.com/CVIU-CSU/TCT-InfoNCE. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.10892 [pdf, other]

First Measurement of Solar $^8$B Neutrino Flux through Coherent Elastic Neutrino-Nucleus Scattering in PandaX-4T

Authors: PandaX Collaboration, Zihao Bo, Wei Chen, Xun Chen, Yunhua Chen, Zhaokan Cheng, Xiangyi Cui, Yingjie Fan, Deqing Fang, Zhixing Gao, Lisheng Geng, Karl Giboni, Xunan Guo, Xuyuan Guo, Zichao Guo, Chencheng Han, Ke Han, Changda He, Jinrong He, Di Huang, Houqi Huang, Junting Huang, Ruquan Hou, Yu Hou, Xiangdong Ji , et al. (77 additional authors not shown)

Abstract: The PandaX-4T liquid xenon detector at the China Jinping Underground Laboratory is used to measure the solar $^8$B neutrino flux by detecting neutrinos through coherent scattering with xenon nuclei. Data samples requiring the coincidence of scintillation and ionization signals (paired), as well as unpaired ionization-only signals (US2), are selected with energy threshold of approximately 1.1 keV (… ▽ More The PandaX-4T liquid xenon detector at the China Jinping Underground Laboratory is used to measure the solar $^8$B neutrino flux by detecting neutrinos through coherent scattering with xenon nuclei. Data samples requiring the coincidence of scintillation and ionization signals (paired), as well as unpaired ionization-only signals (US2), are selected with energy threshold of approximately 1.1 keV (0.33 keV) nuclear recoil energy. Combining the commissioning run and the first science run of PandaX-4T, a total exposure of 1.25 and 1.04 tonne$\cdot$year are collected for the paired and US2, respectively. After unblinding, 3 and 332 events are observed with an expectation of 2.8$\pm$0.5 and 251$\pm$32 background events, for the paired and US2 data, respectively. A combined analysis yields a best-fit $^8$B neutrino signal of 3.5 (75) events from the paired (US2) data sample, with $\sim$37\% uncertainty, and the background-only hypothesis is disfavored at 2.64$σ$ significance. This gives a solar $^8$B neutrino flux of ($8.4\pm3.1$)$\times$10$^6$ cm$^{-2}$s$^{-1}$, consistent with the standard solar model prediction. This is the first indication of solar $^8$B neutrino ``fog'' in a dark matter direct detection experiment. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.10523 [pdf, other]

Variational Quantum Imaginary Time Evolution for Matrix Product State Ansatz with Tests on Transcorrelated Hamiltonians

Authors: Hao-En Li, Xiang Li, Jia-Cheng Huang, Guang-Ze Zhang, Zhu-Ping Shen, Chen Zhao, Jun Li, Han-Shi Hu

Abstract: The matrix product state (MPS) ansatz offers a promising approach for finding the ground state of molecular Hamiltonians and solving quantum chemistry problems. Building on this concept, the proposed technique of quantum circuit MPS (QCMPS) enables the simulation of chemical systems using a relatively small number of qubits. In this study, we enhance the optimization performance of the QCMPS ansat… ▽ More The matrix product state (MPS) ansatz offers a promising approach for finding the ground state of molecular Hamiltonians and solving quantum chemistry problems. Building on this concept, the proposed technique of quantum circuit MPS (QCMPS) enables the simulation of chemical systems using a relatively small number of qubits. In this study, we enhance the optimization performance of the QCMPS ansatz by employing the variational quantum imaginary time evolution (VarQITE) approach. Guided by McLachlan's variational principle, the VarQITE method provides analytical metrics and gradients, resulting in improved convergence efficiency and robustness of the QCMPS. We validate these improvements numerically through simulations of $\rm H_2$, $\rm H_4$, and $\rm LiH$ molecules. Additionally, given that VarQITE is applicable to non-Hermitian Hamiltonians, we evaluate its effectiveness in preparing the ground state of transcorrelated (TC) Hamiltonians. This approach yields energy estimates comparable to the complete basis set (CBS) limit while using even fewer qubits. Specifically, we perform simulations of the beryllium atom and $\rm LiH$ molecule using only three qubits, maintaining high fidelity with the CBS ground state energy of these systems. This qubit reduction is achieved through the combined advantages of both the QCMPS ansatz and transcorrelation. Our findings demonstrate the potential practicality of this quantum chemistry algorithm on near-term quantum devices. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 15 pages, 8 figures

arXiv:2407.10482 [pdf, other]

NGP-RT: Fusing Multi-Level Hash Features with Lightweight Attention for Real-Time Novel View Synthesis

Authors: Yubin Hu, Xiaoyang Guo, Yang Xiao, Jingwei Huang, Yong-Jin Liu

Abstract: This paper presents NGP-RT, a novel approach for enhancing the rendering speed of Instant-NGP to achieve real-time novel view synthesis. As a classic NeRF-based method, Instant-NGP stores implicit features in multi-level grids or hash tables and applies a shallow MLP to convert the implicit features into explicit colors and densities. Although it achieves fast training speed, there is still a lot… ▽ More This paper presents NGP-RT, a novel approach for enhancing the rendering speed of Instant-NGP to achieve real-time novel view synthesis. As a classic NeRF-based method, Instant-NGP stores implicit features in multi-level grids or hash tables and applies a shallow MLP to convert the implicit features into explicit colors and densities. Although it achieves fast training speed, there is still a lot of room for improvement in its rendering speed due to the per-point MLP executions for implicit multi-level feature aggregation, especially for real-time applications. To address this challenge, our proposed NGP-RT explicitly stores colors and densities as hash features, and leverages a lightweight attention mechanism to disambiguate the hash collisions instead of using computationally intensive MLP. At the rendering stage, NGP-RT incorporates a pre-computed occupancy distance grid into the ray marching strategy to inform the distance to the nearest occupied voxel, thereby reducing the number of marching points and global memory access. Experimental results show that on the challenging Mip-NeRF360 dataset, NGP-RT achieves better rendering quality than previous NeRF-based methods, achieving 108 fps at 1080p resolution on a single Nvidia RTX 3090 GPU. Our approach is promising for NeRF-based real-time applications that require efficient and high-quality rendering. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: ECCV 2024

arXiv:2407.10431 [pdf, other]

Coport: A New Public Code for Polarized Radiative Transfer in a Covariant Framework$^\spadesuit$

Authors: Jiewei Huang, Liheng Zheng, Minyong Guo, Bin Chen

Abstract: General relativistic radiative transfer calculations are essential for comparing theoretical models of black hole accretion flows and jets with observational data. In this work, we introduce Coport, a novel public code specifically designed for covariant polarized ray-tracing radiative transfer computations in any spacetime. Written in Julia, Coport includes an interface for visualizing numerical… ▽ More General relativistic radiative transfer calculations are essential for comparing theoretical models of black hole accretion flows and jets with observational data. In this work, we introduce Coport, a novel public code specifically designed for covariant polarized ray-tracing radiative transfer computations in any spacetime. Written in Julia, Coport includes an interface for visualizing numerical results obtained from HARM, a publicly available implementation of the general relativistic magnetohydrodynamics code. We validate the precision of our code by comparing its outputs with the results from a variety of established methodologies. This includes the verification against analytical solutions, the validation through thin-disk assessments, and the evaluation via thick-disk analyses. Notably, our code employs a methodology that eliminates the need for separating the computations of spacetime propagation and plasma propagation. Instead, it directly solves the coupled, covariant, polarized radiative transfer equation in curved spacetime, seamlessly integrating the effects of gravity with plasma influences. This approach sets our code apart from the existing alternatives and enhances its accuracy and efficiency. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 27 pages, 6 figures;

arXiv:2407.10339 [pdf, other]

Supernova Pointing Capabilities of DUNE

Authors: DUNE Collaboration, A. Abed Abud, B. Abi, R. Acciarri, M. A. Acero, M. R. Adames, G. Adamov, M. Adamowski, D. Adams, M. Adinolfi, C. Adriano, A. Aduszkiewicz, J. Aguilar, B. Aimard, F. Akbar, K. Allison, S. Alonso Monsalve, M. Alrashed, A. Alton, R. Alvarez, T. Alves, H. Amar, P. Amedo, J. Anderson, D. A. Andrade , et al. (1340 additional authors not shown)

Abstract: The determination of the direction of a stellar core collapse via its neutrino emission is crucial for the identification of the progenitor for a multimessenger follow-up. A highly effective method of reconstructing supernova directions within the Deep Underground Neutrino Experiment (DUNE) is introduced. The supernova neutrino pointing resolution is studied by simulating and reconstructing electr… ▽ More The determination of the direction of a stellar core collapse via its neutrino emission is crucial for the identification of the progenitor for a multimessenger follow-up. A highly effective method of reconstructing supernova directions within the Deep Underground Neutrino Experiment (DUNE) is introduced. The supernova neutrino pointing resolution is studied by simulating and reconstructing electron-neutrino charged-current absorption on $^{40}$Ar and elastic scattering of neutrinos on electrons. Procedures to reconstruct individual interactions, including a newly developed technique called ``brems flipping'', as well as the burst direction from an ensemble of interactions are described. Performance of the burst direction reconstruction is evaluated for supernovae happening at a distance of 10 kpc for a specific supernova burst flux model. The pointing resolution is found to be 3.4 degrees at 68% coverage for a perfect interaction-channel classification and a fiducial mass of 40 kton, and 6.6 degrees for a 10 kton fiducial mass respectively. Assuming a 4% rate of charged-current interactions being misidentified as elastic scattering, DUNE's burst pointing resolution is found to be 4.3 degrees (8.7 degrees) at 68% coverage. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: 25 pages, 16 figures

Report number: FERMILAB-PUB-24-0319-LBNF

arXiv:2407.10330 [pdf, other]

Tree-D Fusion: Simulation-Ready Tree Dataset from Single Images with Diffusion Priors

Authors: Jae Joong Lee, Bosheng Li, Sara Beery, Jonathan Huang, Songlin Fei, Raymond A. Yeh, Bedrich Benes

Abstract: We introduce Tree D-fusion, featuring the first collection of 600,000 environmentally aware, 3D simulation-ready tree models generated through Diffusion priors. Each reconstructed 3D tree model corresponds to an image from Google's Auto Arborist Dataset, comprising street view images and associated genus labels of trees across North America. Our method distills the scores of two tree-adapted diffu… ▽ More We introduce Tree D-fusion, featuring the first collection of 600,000 environmentally aware, 3D simulation-ready tree models generated through Diffusion priors. Each reconstructed 3D tree model corresponds to an image from Google's Auto Arborist Dataset, comprising street view images and associated genus labels of trees across North America. Our method distills the scores of two tree-adapted diffusion models by utilizing text prompts to specify a tree genus, thus facilitating shape reconstruction. This process involves reconstructing a 3D tree envelope filled with point markers, which are subsequently utilized to estimate the tree's branching structure using the space colonization algorithm conditioned on a specified genus. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV24

arXiv:2407.10207 [pdf, other]

Learning to Steer Markovian Agents under Model Uncertainty

Authors: Jiawei Huang, Vinzenz Thoma, Zebang Shen, Heinrich H. Nax, Niao He

Abstract: Designing incentives for an adapting population is a ubiquitous problem in a wide array of economic applications and beyond. In this work, we study how to design additional rewards to steer multi-agent systems towards desired policies \emph{without} prior knowledge of the agents' underlying learning dynamics. We introduce a model-based non-episodic Reinforcement Learning (RL) formulation for our s… ▽ More Designing incentives for an adapting population is a ubiquitous problem in a wide array of economic applications and beyond. In this work, we study how to design additional rewards to steer multi-agent systems towards desired policies \emph{without} prior knowledge of the agents' underlying learning dynamics. We introduce a model-based non-episodic Reinforcement Learning (RL) formulation for our steering problem. Importantly, we focus on learning a \emph{history-dependent} steering strategy to handle the inherent model uncertainty about the agents' learning dynamics. We introduce a novel objective function to encode the desiderata of achieving a good steering outcome with reasonable cost. Theoretically, we identify conditions for the existence of steering strategies to guide agents to the desired policies. Complementing our theoretical contributions, we provide empirical algorithms to approximately solve our objective, which effectively tackles the challenge in learning history-dependent strategies. We demonstrate the efficacy of our algorithms through empirical evaluations. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: 33 Pages

arXiv:2407.10155 [pdf, other]

A Nançay Radio Telescope study of the hyperactive repeating FRB 20220912A

Authors: David C. Konijn, Danté M. Hewitt, Jason W. T. Hessels, Ismaël Cognard, Jeff Huang, Omar S. Ould-Boukattine, Pragya Chawla, Kenzie Nimmo, Mark P. Snelders, Akshatha Gopinath, Ninisha Manaswini

Abstract: The repeating fast radio burst source FRB 20220912A was remarkably active in the weeks after its discovery. Here we report 696 bursts detected with the Nançay Radio Telescope (NRT) as part of the Extragalactic Coherent Light from Astrophysical Transients (ÉCLAT) monitoring campaign. We present 68 observations, conducted from October 2022 to April 2023, with a total duration of 61 hours and an even… ▽ More The repeating fast radio burst source FRB 20220912A was remarkably active in the weeks after its discovery. Here we report 696 bursts detected with the Nançay Radio Telescope (NRT) as part of the Extragalactic Coherent Light from Astrophysical Transients (ÉCLAT) monitoring campaign. We present 68 observations, conducted from October 2022 to April 2023, with a total duration of 61 hours and an event rate peaking at $75^{+10}_{-9}$ bursts per hour above a fluence threshold of 0.59 Jy ms in the $1.2-1.7$-GHz band. Most bursts in the sample occur towards the bottom of the observing band. They follow a bimodal wait-time distribution, with peaks at 33.4 ms and 67.0 s. We find a roughly constant dispersion measure (DM) over time ($δ$DM $\lesssim$ 2 pc cm$^{-3}$) when taking into account `sad-trombone' drift, with a mean drift rate of $-8.8 $MHz ms$^{-1}$. Nonetheless, we confirm small $\sim0.3$ pc cm$^{-3}$ DM variations using microshot structure, while finding that microstructure is rare in our sample -- despite the 16 $μ$s time resolution of the data. The cumulative spectral energy distribution shows more high-energy bursts ($E_ν\gtrsim 10^{31}$ erg/Hz) than would be expected from a simple power-law distribution. The burst rate per observation appears Poissonian, but the full set of observations is better modelled by a Weibull distribution, showing clustering. We discuss the various observational similarities that FRB 20220912A shares with other (hyper)active repeaters, which as a group are beginning to show a common set of phenomenological traits that provide multiple useful dimensions for their quantitative comparison and modelling. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: 18 pages, 20 figures

arXiv:2407.09932 [pdf, other]

Quantum Clock Synchronization Network with Silicon-chip Dual-Pumped Entangled Photon Source

Authors: J. A. Li, H. Han, X. P. Huang, B. Y. Tang, K. Guo, J. Q. Huang, S. Y. Xiong, W. R. Yu, Z. J. Zhang, J. B. Yang, B. Liu, H. Chen, Z. K. Lu

Abstract: In this paper, we propose a quantum clock synchronization (QCS) network scheme with silicon-chip dual-pumped entangled photon source. This scheme couples two pump beams into the silicon-based waveguide, where degenerate and non-degenerate spontaneous four-wave mixing (SFWM) occurs, generating entanglement between one signal channel and three idler channels. The entangled photons are distributed to… ▽ More In this paper, we propose a quantum clock synchronization (QCS) network scheme with silicon-chip dual-pumped entangled photon source. This scheme couples two pump beams into the silicon-based waveguide, where degenerate and non-degenerate spontaneous four-wave mixing (SFWM) occurs, generating entanglement between one signal channel and three idler channels. The entangled photons are distributed to remote users through the wavelength division multiplexing strategy to construct an entanglement distribution network, and the round-trip QCS is adopted to realize a QCS network that can serve multiple users. A proof-of-principle QCS network experiment is implemented among the server and multiple users (Alice, Bob, and Charlie) for 11.1 hours, where Alice and Charlie are 10 km away from the server and Bob is 25 km away from the server. The lowest time deviations (TDEV) between the server and each user (Alice, Bob, and Charlie) are 1.57 ps, 0.82 ps and 2.57 ps at the average time of 8000 s, 8000 s and 800 s respectively. The results show that the QCS network scheme with dual-pumped SFWM photon source proposed by us achieves high accuracy, and the channel resources used by n users are reduced by about 30% compared with other round-trip QCS schemes. △ Less

Submitted 13 July, 2024; originally announced July 2024.

arXiv:2407.09709 [pdf, other]

GOFA: A Generative One-For-All Model for Joint Graph Language Modeling

Authors: Lecheng Kong, Jiarui Feng, Hao Liu, Chengsong Huang, Jiaxin Huang, Yixin Chen, Muhan Zhang

Abstract: Foundation models, such as Large Language Models (LLMs) or Large Vision Models (LVMs), have emerged as one of the most powerful tools in the respective fields. However, unlike text and image data, graph data do not have a definitive structure, posing great challenges to developing a Graph Foundation Model (GFM). For example, current attempts at designing general graph models either transform graph… ▽ More Foundation models, such as Large Language Models (LLMs) or Large Vision Models (LVMs), have emerged as one of the most powerful tools in the respective fields. However, unlike text and image data, graph data do not have a definitive structure, posing great challenges to developing a Graph Foundation Model (GFM). For example, current attempts at designing general graph models either transform graph data into a language format for LLM-based prediction or still train a GNN model with LLM as an assistant. The former can handle unlimited tasks, while the latter captures graph structure much better -- yet, no existing work can achieve both simultaneously. In this paper, we identify three key desirable properties of a GFM: self-supervised pretraining, fluidity in tasks, and graph awareness. To account for these properties, we extend the conventional language modeling to the graph domain and propose a novel generative graph language model GOFA to solve the problem. The model interleaves randomly initialized GNN layers into a frozen pre-trained LLM so that the semantic and structural modeling abilities are organically combined. GOFA is pre-trained on newly proposed graph-level next-word prediction, question-answering, and structural tasks to obtain the above GFM properties. The pre-trained model is further fine-tuned on downstream tasks to obtain task-solving ability. The fine-tuned model is evaluated on various downstream tasks, demonstrating a strong ability to solve structural and contextual problems in zero-shot scenarios. The code is available at https://github.com/JiaruiFeng/GOFA. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.09684 [pdf, other]

Characteristics and Source Regions of Slow Alfvenic Solar Wind Observed by Parker Solar Probe

Authors: Tamar Ervin, Kai Jaffarove, Samuel T. Badman, Jia Huang, Yeimy J. Rivera, Stuart D. Bale

Abstract: Using a classification scheme for solar wind type based on the heliocentric distance of the observation, we look at near perihelion observations from Parker Solar Probe Encounters Four to Fourteen to study the sources of the slow Alfvenic solar wind (SASW). Through Potential Field Source Surface (PFSS) modeling and ballistic mapping, we connect streams to their solar source and find that a primary… ▽ More Using a classification scheme for solar wind type based on the heliocentric distance of the observation, we look at near perihelion observations from Parker Solar Probe Encounters Four to Fourteen to study the sources of the slow Alfvenic solar wind (SASW). Through Potential Field Source Surface (PFSS) modeling and ballistic mapping, we connect streams to their solar source and find that a primary population of SASW comes from small coronal holes (CHs) and their over-expanded boundaries, while a second population seems to emerge from non-CH structures potentially through interchange reconnection with nearby open field lines. This CH-like SASW shows larger expansion than the FSW, potentially indicating additional heating below the critical point leading to slower wind emerging from CH-like structures. We show that SASW emerging from CH-like structures shows stronger preferential acceleration of alpha particles (similar to the FSW) than the SASW emerging from non-CH structures, and that this is a velocity dependent phenomena as found in previous studies. To have additional confidence in our mapping results, we quantify the error on both the PFSS model and ballistic mapping and discuss how additional multi-point observations of plasma parameters and composition would allow us to better constrain our models and connect the solar wind to its source. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: 25 pages, 27 figures

arXiv:2407.09121 [pdf, other]

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

Authors: Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, Zhaopeng Tu

Abstract: This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at a… ▽ More This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses well-known models such as GPT-4 in defending against attacks. Importantly, our approach successfully defends recent advanced attack methods (e.g., CodeAttack) that have jailbroken GPT-4 and LLaMA3-70B-Instruct. Our code and data can be found at https://github.com/RobustNLP/DeRTa. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.09111 [pdf, other]

Inference Optimization of Foundation Models on AI Accelerators

Authors: Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis

Abstract: Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI across various industries. Industry and research community have witnessed a large number of new applications, based on those foundation models. Such applications include question and answer, customer services, image and video generation, and code completions… ▽ More Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI across various industries. Industry and research community have witnessed a large number of new applications, based on those foundation models. Such applications include question and answer, customer services, image and video generation, and code completions, among others. However, as the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios. As a result, the demand for cost-effective and fast inference using AI accelerators is ever more higher. To this end, our tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators. Beginning with an overview of basic Transformer architectures and deep learning system frameworks, we deep dive into system optimization techniques for fast and memory-efficient attention computations and discuss how they can be implemented efficiently on AI accelerators. Next, we describe architectural elements that are key for fast transformer inference. Finally, we examine various model compression and fast decoding strategies in the same context. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: Tutorial published at KDD 2024. Camera-ready version

arXiv:2407.08112 [pdf, other]

How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities

Authors: Jerry Huang

Abstract: Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the… ▽ More Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the state-space and linear recurrent neural network families of models hypothetically can entend to infinite sequence lenth. However, is this too good to be true? We conduct an evaluation to show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed. In particular, recurrent models still suffer in the same settings as long-context LLMs with attention. We further show that different inductive biases have inconsistent extrapolation capabilities, highlighting the need to further study such paradigms and investigate why long-context models seemingly fail to behave as one might expect. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Work In Progress. 9 pages

arXiv:2407.06567 [pdf, other]

FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement for Enhanced Financial Decision Making

Authors: Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yupeng Cao, Zhi Chen, Jordan W. Suchow, Rong Liu, Zhenyu Cui, Denghui Zhang, Koduvayur Subbalakshmi, Guojun Xiong, Yueru He, Jimin Huang, Dong Li, Qianqian Xie

Abstract: Large language models (LLMs) have demonstrated notable potential in conducting complex tasks and are increasingly utilized in various financial applications. However, high-quality sequential financial investment decision-making remains challenging. These tasks require multiple interactions with a volatile environment for every decision, demanding sufficient intelligence to maximize returns and man… ▽ More Large language models (LLMs) have demonstrated notable potential in conducting complex tasks and are increasingly utilized in various financial applications. However, high-quality sequential financial investment decision-making remains challenging. These tasks require multiple interactions with a volatile environment for every decision, demanding sufficient intelligence to maximize returns and manage risks. Although LLMs have been used to develop agent systems that surpass human teams and yield impressive investment returns, opportunities to enhance multi-sourced information synthesis and optimize decision-making outcomes through timely experience refinement remain unexplored. Here, we introduce the FinCon, an LLM-based multi-agent framework with CONceptual verbal reinforcement tailored for diverse FINancial tasks. Inspired by effective real-world investment firm organizational structures, FinCon utilizes a manager-analyst communication hierarchy. This structure allows for synchronized cross-functional agent collaboration towards unified goals through natural language interactions and equips each agent with greater memory capacity than humans. Additionally, a risk-control component in FinCon enhances decision quality by episodically initiating a self-critiquing mechanism to update systematic investment beliefs. The conceptualized beliefs serve as verbal reinforcement for the future agent's behavior and can be selectively propagated to the appropriate node that requires knowledge updates. This feature significantly improves performance while reducing unnecessary peer-to-peer communication costs. Moreover, FinCon demonstrates strong generalization capabilities in various financial tasks, including single stock trading and portfolio management. △ Less

Submitted 10 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

Comments: LLM Applications, LLM Agents, Financial Technology, Quantitative Finance, Algorithmic Trading, Cognitive Science

arXiv:2407.06546 [pdf, other]

Exploring the Causality of End-to-End Autonomous Driving

Authors: Jiankun Li, Hao Li, Jiangjiang Liu, Zhikang Zou, Xiaoqing Ye, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang

Abstract: Deep learning-based models are widely deployed in autonomous driving areas, especially the increasingly noticed end-to-end solutions. However, the black-box property of these models raises concerns about their trustworthiness and safety for autonomous driving, and how to debug the causality has become a pressing concern. Despite some existing research on the explainability of autonomous driving, t… ▽ More Deep learning-based models are widely deployed in autonomous driving areas, especially the increasingly noticed end-to-end solutions. However, the black-box property of these models raises concerns about their trustworthiness and safety for autonomous driving, and how to debug the causality has become a pressing concern. Despite some existing research on the explainability of autonomous driving, there is currently no systematic solution to help researchers debug and identify the key factors that lead to the final predicted action of end-to-end autonomous driving. In this work, we propose a comprehensive approach to explore and analyze the causality of end-to-end autonomous driving. First, we validate the essential information that the final planning depends on by using controlled variables and counterfactual interventions for qualitative analysis. Then, we quantitatively assess the factors influencing model decisions by visualizing and statistically analyzing the response of key model inputs. Finally, based on the comprehensive study of the multi-factorial end-to-end autonomous driving system, we have developed a strong baseline and a tool for exploring causality in the close-loop simulator CARLA. It leverages the essential input sources to obtain a well-designed model, resulting in highly competitive capabilities. As far as we know, our work is the first to unveil the mystery of end-to-end autonomous driving and turn the black box into a white one. Thorough close-loop experiments demonstrate that our method can be applied to end-to-end autonomous driving solutions for causality debugging. Code will be available at https://github.com/bdvisl/DriveInsight. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.06317 [pdf, other]

Enhanced Safety in Autonomous Driving: Integrating Latent State Diffusion Model for End-to-End Navigation

Authors: Detian Chu, Linyuan Bai, Jianuo Huang, Zhenlong Fang, Peng Zhang, Wei Kang

Abstract: With the advancement of autonomous driving, ensuring safety during motion planning and navigation is becoming more and more important. However, most end-to-end planning methods suffer from a lack of safety. This research addresses the safety issue in the control optimization problem of autonomous driving, formulated as Constrained Markov Decision Processes (CMDPs). We propose a novel, model-based… ▽ More With the advancement of autonomous driving, ensuring safety during motion planning and navigation is becoming more and more important. However, most end-to-end planning methods suffer from a lack of safety. This research addresses the safety issue in the control optimization problem of autonomous driving, formulated as Constrained Markov Decision Processes (CMDPs). We propose a novel, model-based approach for policy optimization, utilizing a conditional Value-at-Risk based Soft Actor Critic to manage constraints in complex, high-dimensional state spaces effectively. Our method introduces a worst-case actor to guide safe exploration, ensuring rigorous adherence to safety requirements even in unpredictable scenarios. The policy optimization employs the Augmented Lagrangian method and leverages latent diffusion models to predict and simulate future trajectories. This dual approach not only aids in navigating environments safely but also refines the policy's performance by integrating distribution modeling to account for environmental uncertainties. Empirical evaluations conducted in both simulated and real environment demonstrate that our approach outperforms existing methods in terms of safety, efficiency, and decision-making capabilities. △ Less

Submitted 16 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.06204 [pdf, other]

A Survey on Mixture of Experts

Authors: Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, Jiayi Huang

Abstract: Large language models (LLMs) have garnered unprecedented advancements across diverse fields, ranging from natural language processing to computer vision and beyond. The prowess of LLMs is underpinned by their substantial model size, extensive and diverse datasets, and the vast computational power harnessed during training, all of which contribute to the emergent abilities of LLMs (e.g., in-context… ▽ More Large language models (LLMs) have garnered unprecedented advancements across diverse fields, ranging from natural language processing to computer vision and beyond. The prowess of LLMs is underpinned by their substantial model size, extensive and diverse datasets, and the vast computational power harnessed during training, all of which contribute to the emergent abilities of LLMs (e.g., in-context learning) that are not present in small models. Within this context, the mixture of experts (MoE) has emerged as an effective method for substantially scaling up model capacity with minimal computation overhead, gaining significant attention from academia and industry. Despite its growing prevalence, there lacks a systematic and comprehensive review of the literature on MoE. This survey seeks to bridge that gap, serving as an essential resource for researchers delving into the intricacies of MoE. We first briefly introduce the structure of the MoE layer, followed by proposing a new taxonomy of MoE. Next, we overview the core designs for various MoE models including both algorithmic and systemic aspects, alongside collections of available open-source implementations, hyperparameter configurations and empirical evaluations. Furthermore, we delineate the multifaceted applications of MoE in practice, and outline some potential directions for future research. To facilitate ongoing updates and the sharing of cutting-edge developments in MoE research, we have established a resource repository accessible at https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts. △ Less

Submitted 26 June, 2024; originally announced July 2024.

arXiv:2407.05749 [pdf, other]

LDGCN: An Edge-End Lightweight Dual GCN Based on Single-Channel EEG for Driver Drowsiness Monitoring

Authors: Jingwei Huang, Chuansheng Wang, Jiayan Huang, Haoyi Fan, Antoni Grau, Fuquan Zhang

Abstract: Driver drowsiness electroencephalography (EEG) signal monitoring can timely alert drivers of their drowsiness status, thereby reducing the probability of traffic accidents. Graph convolutional networks (GCNs) have shown significant advancements in processing the non-stationary, time-varying, and non-Euclidean nature of EEG signals. However, the existing single-channel EEG adjacency graph construct… ▽ More Driver drowsiness electroencephalography (EEG) signal monitoring can timely alert drivers of their drowsiness status, thereby reducing the probability of traffic accidents. Graph convolutional networks (GCNs) have shown significant advancements in processing the non-stationary, time-varying, and non-Euclidean nature of EEG signals. However, the existing single-channel EEG adjacency graph construction process lacks interpretability, which hinders the ability of GCNs to effectively extract adjacency graph features, thus affecting the performance of drowsiness monitoring. To address this issue, we propose an edge-end lightweight dual graph convolutional network (LDGCN). Specifically, we are the first to incorporate neurophysiological knowledge to design a Baseline Drowsiness Status Adjacency Graph (BDSAG), which characterizes driver drowsiness status. Additionally, to express more features within limited EEG data, we introduce the Augmented Graph-level Module (AGM). This module captures global and local information at the graph level, ensuring that BDSAG features remain intact while enhancing effective feature expression capability. Furthermore, to deploy our method on the fourth-generation Raspberry Pi, we utilize Adaptive Pruning Optimization (APO) on both channels and neurons, reducing inference latency by almost half. Experiments on benchmark datasets demonstrate that LDGCN offers the best trade-off between monitoring performance and hardware resource utilization compared to existing state-of-the-art algorithms. All our source code can be found at https://github.com/BryantDom/Driver-Drowsiness-Monitoring. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.05679 [pdf, other]

BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space

Authors: Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang

Abstract: World models are receiving increasing attention in autonomous driving for their ability to predict potential future scenarios. In this paper, we present BEVWorld, a novel approach that tokenizes multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for environment modeling. The world model consists of two parts: the multi-modal tokenizer and the latent BEV sequence… ▽ More World models are receiving increasing attention in autonomous driving for their ability to predict potential future scenarios. In this paper, we present BEVWorld, a novel approach that tokenizes multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for environment modeling. The world model consists of two parts: the multi-modal tokenizer and the latent BEV sequence diffusion model. The multi-modal tokenizer first encodes multi-modality information and the decoder is able to reconstruct the latent BEV tokens into LiDAR and image observations by ray-casting rendering in a self-supervised manner. Then the latent BEV sequence diffusion model predicts future scenarios given action tokens as conditions. Experiments demonstrate the effectiveness of BEVWorld in autonomous driving tasks, showcasing its capability in generating future scenes and benefiting downstream tasks such as perception and motion prediction. Code will be available at https://github.com/zympsyche/BevWorld. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 10 pages

arXiv:2407.05573 [pdf]

Spatio-Temporal Encoding and Decoding-Based Method for Future Human Activity Skeleton Synthesis

Authors: Tingyu Liu, Jun Huang, Chenyi Weng

Abstract: Inferring future activity information based on observed activity data is a crucial step to improve the accuracy of early activity prediction. Traditional methods based on generative adversarial networks(GAN) or joint learning frameworks can achieve good prediction accuracy under low observation ratios, but they usually have high computational costs. In view of this, this paper proposes a spatio-te… ▽ More Inferring future activity information based on observed activity data is a crucial step to improve the accuracy of early activity prediction. Traditional methods based on generative adversarial networks(GAN) or joint learning frameworks can achieve good prediction accuracy under low observation ratios, but they usually have high computational costs. In view of this, this paper proposes a spatio-temporal encoding and decoding-based method for future human activity skeleton synthesis. Firstly, algorithms such as time control, discrete cosine transform, and low-pass filtering are used to cut or pad the skeleton sequences. Secondly, the encoder and decoder are responsible for extracting intermediate semantic encoding from observed skeleton sequences and inferring future sequences from the intermediate semantic encoding, respectively. Finally, joint displacement error, velocity error, and acceleration error, three higher-order kinematic features, are used as key components of the loss function to optimize model parameters. Experimental results show that the proposed future skeleton synthesis algorithm performs better than some existing algorithms. It generates skeleton sequences with smaller errors and fewer model parameters, effectively providing future information for early activity prediction. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2407.05350 [pdf]

Multiple boundary states in bilayer and decorated Su-Schrieffer-Heeger-like models

Authors: Shengqun Guo, Jinke Huang, Ruimin Huang, Fengjiang Zhuang, Zhili Lin, Weibin Qiu

Abstract: Topological boundary states have attracted widespread fascination due to their series of intriguing properties. In this paper, we investigate the multiple boundary states within the two kinds of extended Su-Schrieffer-Heeger (SSH) models. The coexistence of boundary states that exist both in the bulk and band gaps is realized based on the bilayer SSH-like model, which consists of two conventional… ▽ More Topological boundary states have attracted widespread fascination due to their series of intriguing properties. In this paper, we investigate the multiple boundary states within the two kinds of extended Su-Schrieffer-Heeger (SSH) models. The coexistence of boundary states that exist both in the bulk and band gaps is realized based on the bilayer SSH-like model, which consists of two conventional square-root SSH models that are directly coupled. We further show the square-root topology within the decorated SSH-like model, which supports multiple boundary states that could be embedded into the bulk continuum by tuning the hopping parameters. In addition, the connection between the decorated SSH-like model and its effectively decomposed counterparts is revealed. Our results broaden insight into the multiple boundary states and open up an exciting avenue for the future exploration of square-root topology. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2407.04932 [pdf, other]

Helicity-changing Decays of Cosmological Relic Neutrinos

Authors: Jihong Huang, Shun Zhou

Abstract: In this paper, we examine the possibility that massive neutrinos are unstable due to their invisible decays $ν^{}_i \to ν^{}_j + φ$, where $ν^{}_i$ and $ν^{}_j$ (for $i, j = 1, 2, 3$) are any two of neutrino mass eigenstates with masses $m^{}_i > m^{}_j$ and $φ$ is a massless Nambu-Goldstone boson, and explore the implications for the detection of cosmological relic neutrinos in the present Univer… ▽ More In this paper, we examine the possibility that massive neutrinos are unstable due to their invisible decays $ν^{}_i \to ν^{}_j + φ$, where $ν^{}_i$ and $ν^{}_j$ (for $i, j = 1, 2, 3$) are any two of neutrino mass eigenstates with masses $m^{}_i > m^{}_j$ and $φ$ is a massless Nambu-Goldstone boson, and explore the implications for the detection of cosmological relic neutrinos in the present Universe. First, we carry out a complete calculation of neutrino decay rates in the general case where the individual helicities of parent and daughter neutrinos are specified. Then, the invisible decays of cosmological relic neutrinos are studied and their impact on the capture rates on the beta-decaying nuclei (e.g., $ν^{}_e + {^3{\rm H}} \to {^3{\rm He}} + e^-$) is analyzed. The invisible decays of massive neutrinos could substantially change the capture rates in the PTOLEMY-like experiments when compared to the case of stable neutrinos. In particular, we find that the helicity-changing decays of Dirac neutrinos play an important role whereas those of Majorana neutrinos have no practical effects. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 28 pages, 7 figures

arXiv:2407.04069 [pdf, other]

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Authors: Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty, Jimmy Huang

Abstract: Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the comple… ▽ More Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.03202 [pdf, other]

Clifford Circuits Augmented Time-Dependent Variational Principle

Authors: Xiangjian Qian, Jiale Huang, Mingpu Qin

Abstract: The recently proposed Clifford Circuits Augmented Matrix Product States (CA-MPS) (arXiv:2405.09217) seamlessly augments Density Matrix Renormalization Group with Clifford circuits. In CA-MPS, the entanglement from stabilizers is transferred to the Clifford circuits which can be easily handled according to the Gottesman-Knill theorem. As a result, MPS needs only to deal with the non-stabilizer enta… ▽ More The recently proposed Clifford Circuits Augmented Matrix Product States (CA-MPS) (arXiv:2405.09217) seamlessly augments Density Matrix Renormalization Group with Clifford circuits. In CA-MPS, the entanglement from stabilizers is transferred to the Clifford circuits which can be easily handled according to the Gottesman-Knill theorem. As a result, MPS needs only to deal with the non-stabilizer entanglement, which largely reduce the bond dimension and the resource required for the accurate simulation of many-body systems. In this work, we generalize CA-MPS to the framework of Time-Dependent Variational Principle (TDVP) for time evolution simulations. In this method, we apply Clifford circuits to the resulting MPS in each TDVP step with a two-site sweeping process similar as in DMRG, aiming at reducing the entanglement entropy in the MPS, and the Hamiltonian is transformed accordingly using the chosen Clifford circuits. Similar as in CA-MPS, the Clifford circuits doesn't increase the number of terms in the Hamiltonian which makes the overhead very small in the new method. We test this method in both XXZ chain and two dimensional Heisenberg model. The results show that the Clifford circuits augmented TDVP method can reduce the entanglement entropy in the time evolution process and hence makes the simulation reliable for longer time. The Clifford circuits augmented Time-Dependent Variational Principle provides a useful tool for the simulation of time evolution process of many-body systems in the future. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.02973 [pdf, other]

NOEMA formIng Cluster survEy (NICE): Characterizing eight massive galaxy groups at $1.5 < z < 4$ in the COSMOS field

Authors: Nikolaj B. Sillassen, Shuowen Jin, Georgios E. Magdis, Emanuele Daddi, Tao Wang, Shiying Lu, Hanwen Sun, Vinod Arumugam, Daizhong Liu, Malte Brinch, Chiara D'Eugenio, Raphael Gobat, Carlos Gómez-Guijarro, Michael Rich, Eva Schinnerer, Veronica Strazzullo, Qinghua Tan, Francesco Valentino, Yijun Wang, Mengyuan Xiao, Luwenjia Zhou, David Blánquez-Sesé, Zheng Cai, Yanmei Chen, Laure Ciesla , et al. (19 additional authors not shown)

Abstract: The NOEMA formIng Cluster survEy (NICE) is a large program targeting 69 massive galaxy group candidates at $z>2$ in six deep fields. We report spectroscopic confirmation of eight groups at $1.65\leq z\leq3.61$ in COSMOS. Homogeneously selected as significant overdensities of red IRAC sources with red Herschel colors, four groups are confirmed by CO and [CI] with NOEMA 3mm observations, three are c… ▽ More The NOEMA formIng Cluster survEy (NICE) is a large program targeting 69 massive galaxy group candidates at $z>2$ in six deep fields. We report spectroscopic confirmation of eight groups at $1.65\leq z\leq3.61$ in COSMOS. Homogeneously selected as significant overdensities of red IRAC sources with red Herschel colors, four groups are confirmed by CO and [CI] with NOEMA 3mm observations, three are confirmed with ALMA, and one is confirmed by H$α$ from Subaru/FMOS. We constructed the integrated FIR SEDs for the eight groups, obtaining total IR SFR $=260-1300~{\rm M_\odot}$~yr$^{-1}$. We adopted six methods to estimate the dark matter masses, including stellar mass to halo mass relations, overdensity with galaxy bias, and NFW profile fitting to radial stellar mass density. We found the radial stellar mass density are consistent with a NFW profile, supporting that they are collapsed structures hosted by a single dark matter halo. The best halo mass estimates are $\log(M_{\rm h}/{\rm M_\odot})=12.8-13.7$ with uncertainty of 0.3 dex. From halo mass estimates, we derive baryonic accretion rate ${\rm BAR}=(1-8)\times10^{3}\,{\rm M_{\odot}/yr}$ for this sample. We find a quasi-linear correlation between the integrated SFR/BAR and the theoretical halo mass limit for cold streams, $M_{\rm stream}/M_{\rm h}$, with ${\rm SFR/BAR}=10^{-0.46\pm0.22}\left({M_{\rm stream}/M_{\rm h}}\right)^{0.71\pm0.16}$ with a scatter of $0.40\,{\rm dex}$. Further, we compare halo masses and stellar masses with simulations, and find all structures are consistent with being progenitors of $M_{\rm h}(z=0)>10^{14}\,{\rm M_{\odot}}$ galaxy clusters, and the most massive central galaxies have stellar masses consistent with brightest cluster galaxies (BCGs) progenitors in the TNG300 simulation. The results strongly suggest these structures are forming massive galaxy clusters via baryonic and dark matter accretion. △ Less

Submitted 5 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

Comments: 44 pages (27pp appendix), 32 figures, 18 tables, accepted for publication in A&A

arXiv:2407.02803 [pdf, other]

KnobCF: Uncertainty-aware Knob Tuning

Authors: Yu Yan, Junfang Huang, Hongzhi Wang, Jian Geng, Kaixin Zhang, Tao Yu

Abstract: The knob tuning aims to optimize database performance by searching for the most effective knob configuration under a certain workload. Existing works suffer two significant problems. On the one hand, there exist multiple similar even useless evaluations of knob tuning even with the diverse searching methods because of the different sensitivities of knobs on a certain workload. On the other hand, t… ▽ More The knob tuning aims to optimize database performance by searching for the most effective knob configuration under a certain workload. Existing works suffer two significant problems. On the one hand, there exist multiple similar even useless evaluations of knob tuning even with the diverse searching methods because of the different sensitivities of knobs on a certain workload. On the other hand, the single evaluation of knob configurations may bring overestimation or underestimation because of the query uncertainty performance. To solve the above problems, we propose a decoupled query uncertainty-aware knob classifier, called KnobCF, to enhance the knob tuning. Our method has three significant contributions: (1) We propose a novel concept of the uncertainty-aware knob configuration estimation to enhance the knob tuning process. (2) We provide an effective few-shot uncertainty knob estimator without extra time consumption in training data collection, which has a high time efficiency in practical tuning tasks. (3) Our method provides a general framework that could be easily deployed in any knob tuning task because we make no changes to the knob tuners and the database management system. Our experiments on four open-source benchmarks demonstrate that our method effectively reduces useless evaluations and improves the tuning results. Especially in TPCC, our method achieves competitive tuning results with only 60% to 70% time consumption compared to the full workload evaluations. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.01781 [pdf, other]

doi 10.1145/3658226

fVDB: A Deep-Learning Framework for Sparse, Large-Scale, and High-Performance Spatial Intelligence

Authors: Francis Williams, Jiahui Huang, Jonathan Swartz, Gergely Klár, Vijay Thakkar, Matthew Cong, Xuanchi Ren, Ruilong Li, Clement Fuji-Tsang, Sanja Fidler, Eftychios Sifakis, Ken Museth

Abstract: We present fVDB, a novel GPU-optimized framework for deep learning on large-scale 3D data. fVDB provides a complete set of differentiable primitives to build deep learning architectures for common tasks in 3D learning such as convolution, pooling, attention, ray-tracing, meshing, etc. fVDB simultaneously provides a much larger feature set (primitives and operators) than established frameworks wi… ▽ More We present fVDB, a novel GPU-optimized framework for deep learning on large-scale 3D data. fVDB provides a complete set of differentiable primitives to build deep learning architectures for common tasks in 3D learning such as convolution, pooling, attention, ray-tracing, meshing, etc. fVDB simultaneously provides a much larger feature set (primitives and operators) than established frameworks with no loss in efficiency: our operators match or exceed the performance of other frameworks with narrower scope. Furthermore, fVDB can process datasets with much larger footprint and spatial resolution than prior works, while providing a competitive memory footprint on small inputs. To achieve this combination of versatility and performance, fVDB relies on a single novel VDB index grid acceleration structure paired with several key innovations including GPU accelerated sparse grid construction, convolution using tensorcores, fast ray tracing kernels using a Hierarchical Digital Differential Analyzer algorithm (HDDA), and jagged tensors. Our framework is fully integrated with PyTorch enabling interoperability with existing pipelines, and we demonstrate its effectiveness on a number of representative tasks such as large-scale point-cloud segmentation, high resolution 3D generative modeling, unbounded scale Neural Radiance Fields, and large-scale point cloud reconstruction. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01679 [pdf, other]

Constraints on the gas-phase C/O ratio of DR Tau's outer disk from CS, SO, and C$_2$H observations

Authors: Jane Huang, Edwin A. Bergin, Romane Le Gal, Sean M. Andrews, Jaehan Bae, Luke Keyte, J. A. Sturm

Abstract: Millimeter wavelength observations of Class II protoplanetary disks often display strong emission from hydrocarbons and high CS/SO values, providing evidence that the gas-phase C/O ratio commonly exceeds 1 in their outer regions. We present new NOEMA observations of CS $5-4$, SO $7_6-6_5$ and $5_6-4_5$, C$_2$H $N=3-2$, HCN $3-2$, HCO$^+$ $3-2$, and H$^{13}$CO$^+$ $3-2$ in the DR Tau protoplanetary… ▽ More Millimeter wavelength observations of Class II protoplanetary disks often display strong emission from hydrocarbons and high CS/SO values, providing evidence that the gas-phase C/O ratio commonly exceeds 1 in their outer regions. We present new NOEMA observations of CS $5-4$, SO $7_6-6_5$ and $5_6-4_5$, C$_2$H $N=3-2$, HCN $3-2$, HCO$^+$ $3-2$, and H$^{13}$CO$^+$ $3-2$ in the DR Tau protoplanetary disk at a resolution of $\sim0.4''$ (80 au). Estimates for the disk-averaged CS/SO ratio range from $\sim0.4-0.5$, the lowest value reported thus far for a T Tauri disk. At a projected separation of $\sim180$ au northeast of the star, the SO moment maps exhibit a clump that has no counterpart in the other lines, and the CS/SO value decreases to $<0.2$ at its location. Thermochemical models calculated with DALI indicate that DR Tau's low CS/SO ratio and faint C$_2$H emission can be explained by a gas-phase C/O ratio that is $<1$ at the disk radii traced by NOEMA. Comparisons of DR Tau's SO emission to maps of extended structures traced by $^{13}$CO suggest that late infall may contribute to driving down the gas-phase C/O ratio of its disk. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Accepted by ApJ

arXiv:2407.01654 [pdf, other]

A thermodynamically consistent phase-field lattice Boltzmann method for two-phase electrohydrodynamic flows

Authors: Fang Xiong, Lei Wang, Jiangxu Huang, Kang Luo

Abstract: In this work, we aim to develop a phase-field based lattice Boltzmann (LB) method for simulating two-phase electrohydrodynamics (EHD) flows, which allows for different properties (densities, viscosities, conductivity and permittivity) of each phase while maintaining thermodynamic consistency. To this end, we first present a theoretical analysis on the two-phase EHD flows by using the Onsager's var… ▽ More In this work, we aim to develop a phase-field based lattice Boltzmann (LB) method for simulating two-phase electrohydrodynamics (EHD) flows, which allows for different properties (densities, viscosities, conductivity and permittivity) of each phase while maintaining thermodynamic consistency. To this end, we first present a theoretical analysis on the two-phase EHD flows by using the Onsager's variational principle, which is an extension of Rayleigh's principle of least energy dissipation and, naturally, guarantees thermodynamic consistency. It shows that the governing equations of the model include the hydrodynamic equations, Cahn-Hilliard equation coupled with additional electrical effect, and the full Poisson-Nernst-Planck electrokinetic equations. After that, a coupled lattice Boltzmann (LB) scheme is constructed for simulating two-phase EHD flows. In particular, in order to handle two-phase EHD flows with a relatively larger electric permittivity ratio, we also introduce a delicately designed discrete forcing term into the LB equation for electrostatic field. Moreover, some numerical examples including two-phase EHD flows in planar layers and charge diffusion of a Gaussian bell are simulated with the developed LB method. It is shown that our numerical scheme shares a second-order convergence rate in space in predicting electric potential and charge density. Finally, we used the current model to simulate the deformation of a droplet under an electric field and the dynamics of droplet detachment in reversed electrowetting. Our numerical results align well with the theoretic solutions, and the available experimental/numerical data, demonstrating that the proposed method is feasible for simulating two-phase EHD flows. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01541 [pdf]

Integration of Computer Networks and Artificial Neural Networks for an AI-based Network Operator

Authors: Binbin Wu, Jingyu Xu, Yifan Zhang, Bo Liu, Yulu Gong, Jiaxin Huang

Abstract: This paper proposes an integrated approach combining computer networks and artificial neural networks to construct an intelligent network operator, functioning as an AI model. State information from computer networks is transformed into embedded vectors, enabling the operator to efficiently recognize different pieces of information and accurately output appropriate operations for the computer netw… ▽ More This paper proposes an integrated approach combining computer networks and artificial neural networks to construct an intelligent network operator, functioning as an AI model. State information from computer networks is transformed into embedded vectors, enabling the operator to efficiently recognize different pieces of information and accurately output appropriate operations for the computer network at each step. The operator has undergone comprehensive testing, achieving a 100% accuracy rate, thus eliminating operational risks. Furthermore, a novel algorithm is proposed to emphasize crucial training losses, aiming to enhance the efficiency of operator training. Additionally, a simple computer network simulator is created and encapsulated into training and testing environment components, enabling automation of the data collection, training, and testing processes. This abstract outlines the core contributions of the paper while highlighting the innovative methodology employed in the development and validation of the AI-based network operator. △ Less

Submitted 9 April, 2024; originally announced July 2024.

arXiv:2407.01312 [pdf, other]

ToCoAD: Two-Stage Contrastive Learning for Industrial Anomaly Detection

Authors: Yun Liang, Zhiguang Hu, Junjie Huang, Donglin Di, Anyang Su, Lei Fan

Abstract: Current unsupervised anomaly detection approaches perform well on public datasets but struggle with specific anomaly types due to the domain gap between pre-trained feature extractors and target-specific domains. To tackle this issue, this paper presents a two-stage training strategy, called \textbf{ToCoAD}. In the first stage, a discriminative network is trained by using synthetic anomalies in a… ▽ More Current unsupervised anomaly detection approaches perform well on public datasets but struggle with specific anomaly types due to the domain gap between pre-trained feature extractors and target-specific domains. To tackle this issue, this paper presents a two-stage training strategy, called \textbf{ToCoAD}. In the first stage, a discriminative network is trained by using synthetic anomalies in a self-supervised learning manner. This network is then utilized in the second stage to provide a negative feature guide, aiding in the training of the feature extractor through bootstrap contrastive learning. This approach enables the model to progressively learn the distribution of anomalies specific to industrial datasets, effectively enhancing its generalizability to various types of anomalies. Extensive experiments are conducted to demonstrate the effectiveness of our proposed two-stage training strategy, and our model produces competitive performance, achieving pixel-level AUROC scores of 98.21\%, 98.43\% and 97.70\% on MVTec AD, VisA and BTAD respectively. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 11 pages, 7 figures

arXiv:2407.01015 [pdf, other]

Bayesian Entropy Neural Networks for Physics-Aware Prediction

Authors: Rahul Rathnakumar, Jiayu Huang, Hao Yan, Yongming Liu

Abstract: This paper addresses the need for deep learning models to integrate well-defined constraints into their outputs, driven by their application in surrogate models, learning with limited data and partial information, and scenarios requiring flexible model behavior to incorporate non-data sample information. We introduce Bayesian Entropy Neural Networks (BENN), a framework grounded in Maximum Entropy… ▽ More This paper addresses the need for deep learning models to integrate well-defined constraints into their outputs, driven by their application in surrogate models, learning with limited data and partial information, and scenarios requiring flexible model behavior to incorporate non-data sample information. We introduce Bayesian Entropy Neural Networks (BENN), a framework grounded in Maximum Entropy (MaxEnt) principles, designed to impose constraints on Bayesian Neural Network (BNN) predictions. BENN is capable of constraining not only the predicted values but also their derivatives and variances, ensuring a more robust and reliable model output. To achieve simultaneous uncertainty quantification and constraint satisfaction, we employ the method of multipliers approach. This allows for the concurrent estimation of neural network parameters and the Lagrangian multipliers associated with the constraints. Our experiments, spanning diverse applications such as beam deflection modeling and microstructure generation, demonstrate the effectiveness of BENN. The results highlight significant improvements over traditional BNNs and showcase competitive performance relative to contemporary constrained deep learning methods. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 15 pages

ACM Class: I.5.1

arXiv:2407.00741 [pdf, other]

Diffusion Models for Offline Multi-agent Reinforcement Learning with Safety Constraints

Authors: Jianuo Huang

Abstract: In recent advancements in Multi-agent Reinforcement Learning (MARL), its application has extended to various safety-critical scenarios. However, most methods focus on online learning, which presents substantial risks when deployed in real-world settings. Addressing this challenge, we introduce an innovative framework integrating diffusion models within the MARL paradigm. This approach notably enha… ▽ More In recent advancements in Multi-agent Reinforcement Learning (MARL), its application has extended to various safety-critical scenarios. However, most methods focus on online learning, which presents substantial risks when deployed in real-world settings. Addressing this challenge, we introduce an innovative framework integrating diffusion models within the MARL paradigm. This approach notably enhances the safety of actions taken by multiple agents through risk mitigation while modeling coordinated action. Our framework is grounded in the Centralized Training with Decentralized Execution (CTDE) architecture, augmented by a Diffusion Model for prediction trajectory generation. Additionally, we incorporate a specialized algorithm to further ensure operational safety. We evaluate our model against baselines on the DSRL benchmark. Experiment results demonstrate that our model not only adheres to stringent safety constraints but also achieves superior performance compared to existing methodologies. This underscores the potential of our approach in advancing the safety and efficacy of MARL in real-world applications. △ Less

Submitted 15 July, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

arXiv:2407.00468 [pdf, other]

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Authors: Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang

Abstract: Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial p… ▽ More Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises $2,138$ question triplets, totaling $6,414$ distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by $31.73\%$, compared to an average gap of $8.03\%$ in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by $23.09\%$, whereas the gap for previous benchmarks is just $14.64\%$). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Comments: 21 pages, code released at https://github.com/chenllliang/MMEvalPro, Homepage at https://mmevalpro.github.io/

arXiv:2407.00431 [pdf, other]

Location embedding based pairwise distance learning for fine-grained diagnosis of urinary stones

Authors: Qiangguo Jin, Jiapeng Huang, Changming Sun, Hui Cui, Ping Xuan, Ran Su, Leyi Wei, Yu-Jie Wu, Chia-An Wu, Henry B. L. Duh, Yueh-Hsun Lu

Abstract: The precise diagnosis of urinary stones is crucial for devising effective treatment strategies. The diagnostic process, however, is often complicated by the low contrast between stones and surrounding tissues, as well as the variability in stone locations across different patients. To address this issue, we propose a novel location embedding based pairwise distance learning network (LEPD-Net) that… ▽ More The precise diagnosis of urinary stones is crucial for devising effective treatment strategies. The diagnostic process, however, is often complicated by the low contrast between stones and surrounding tissues, as well as the variability in stone locations across different patients. To address this issue, we propose a novel location embedding based pairwise distance learning network (LEPD-Net) that leverages low-dose abdominal X-ray imaging combined with location information for the fine-grained diagnosis of urinary stones. LEPD-Net enhances the representation of stone-related features through context-aware region enhancement, incorporates critical location knowledge via stone location embedding, and achieves recognition of fine-grained objects with our innovative fine-grained pairwise distance learning. Additionally, we have established an in-house dataset on urinary tract stones to demonstrate the effectiveness of our proposed approach. Comprehensive experiments conducted on this dataset reveal that our framework significantly surpasses existing state-of-the-art methods. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Journal ref: MICCAI 2024

arXiv:2406.18657 [pdf, other]

Exploring the Complex Ionization Environment of the Turbulent DM Tau Disk

Authors: Deryl E. Long, L. Ilsedore Cleeves, Fred C. Adams, Sean Andrews, Edwin A. Bergin, Viviana V. Guzmán, Jane Huang, A. Meredith Hughes, Chunhua Qi, Kamber Schwarz, Jacob B. Simon, David Wilner

Abstract: Ionization drives important chemical and dynamical processes within protoplanetary disks, including the formation of organics and water in the cold midplane and the transportation of material via accretion and magneto-hydrodynamic (MHD) flows. Understanding these ionization-driven processes is crucial for understanding disk evolution and planet formation. We use new and archival ALMA observations… ▽ More Ionization drives important chemical and dynamical processes within protoplanetary disks, including the formation of organics and water in the cold midplane and the transportation of material via accretion and magneto-hydrodynamic (MHD) flows. Understanding these ionization-driven processes is crucial for understanding disk evolution and planet formation. We use new and archival ALMA observations of HCO+, H13CO+, and N2H+ to produce the first forward-modeled 2D ionization constraints for the DM Tau protoplanetary disk. We include ionization from multiple sources and explore the disk chemistry under a range of ionizing conditions. Abundances from our 2D chemical models are post-processed using non-LTE radiative transfer, visibility sampling, and imaging, and are compared directly to the observed radial emission profiles. The observations are best fit by a modestly reduced CR ionization rate ($ζ_{CR}$ ~ 10$^{-18}$ s$^{-1}$) and a hard X-ray spectrum (hardness ratio [HR] = 0.3), which we associate with stellar flaring conditions. Our best-fit model under-produces emission in the inner disk, suggesting that there may be an additional mechanism enhancing ionization in DM Tau's inner disk. Overall, our findings highlight the complexity of ionization in protoplanetary disks and the need for high resolution multi-line studies. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 18 pages, 12 figures, accepted to be published in The Astrophysical Journal (June 25, 2024)

arXiv:2406.18522 [pdf, other]

ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

Authors: Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan

Abstract: We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on the visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos wi… ▽ More We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on the visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization comprehensively evaluates the model's capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions ensuring high physical pertinence and large metamorphic amplitude. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 31 pages, 15 figures

arXiv:2406.18139 [pdf, other]

LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

Authors: Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan

Abstract: Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temp… ▽ More Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs' KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge. In this work, we introduce LOOK-M, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. LOOK-M demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by 80% in some cases, it not only achieves up to 1.5x faster decoding but also maintains or even enhances performance across a variety of long context multimodal tasks. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.17880 [pdf, other]

MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval

Authors: Weitong Cai, Jiabo Huang, Shaogang Gong, Hailin Jin, Yang Liu

Abstract: Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable porti… ▽ More Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. It confines the cross-modal alignment knowledge within the scope of a limited text corpus, thereby leading to sub-optimal visual-textual modeling and poor generalizability. By leveraging the visual-textual understanding capability of multi-modal large language models (MLLM), in this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization. To effectively maintain temporal sensibility for localization, we design to get text narratives for each certain video timestamp and construct a structured text paragraph with time information, which is temporally aligned with the visual content. Then we perform cross-modal feature merging between the temporal-aware narratives and corresponding video temporal features to produce semantic-enhanced video representation sequences for query localization. Subsequently, we introduce a uni-modal narrative-query matching mechanism, which encourages the model to extract complementary information from contextual cohesive descriptions for improved retrieval. Extensive experiments on two benchmarks show the effectiveness and generalizability of our proposed method. △ Less