subscribe to arXiv mailings

High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching

Authors: Gael Le Lan, Bowen Shi, Zhaoheng Ni, Sidd Srinivasan, Anurag Kumar, Brian Ellis, David Kant, Varun Nagaraja, Ernie Chang, Wei-Ning Hsu, Yangyang Shi, Vikas Chandra

Abstract: We introduce a simple and efficient text-controllable high-fidelity music generation and editing model. It operates on sequences of continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec that eliminates the information loss drawback of discrete representations. Based on a diffusion transformer architecture trained on a flow-matching objective the model… ▽ More We introduce a simple and efficient text-controllable high-fidelity music generation and editing model. It operates on sequences of continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec that eliminates the information loss drawback of discrete representations. Based on a diffusion transformer architecture trained on a flow-matching objective the model can generate and edit diverse high quality stereo samples of variable duration, with simple text descriptions. We also explore a new regularized latent inversion method for zero-shot test-time text-guided editing and demonstrate its superior performance over naive denoising diffusion implicit model (DDIM) inversion for variety of music editing prompts. Evaluations are conducted on both objective and subjective metrics and demonstrate that the proposed model is not only competitive to the evaluated baselines on a standard text-to-music benchmark - quality and efficiency-wise - but also outperforms previous state of the art for music editing when combined with our proposed latent inversion. Samples are available at https://melodyflow.github.io. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2406.15007 [pdf, other]

RouteFinder: Towards Foundation Models for Vehicle Routing Problems

Authors: Federico Berto, Chuanbo Hua, Nayeli Gast Zepeda, André Hottung, Niels Wouda, Leon Lan, Kevin Tierney, Jinkyoo Park

Abstract: Vehicle Routing Problems (VRPs) are optimization problems with significant real-world implications in logistics, transportation, and supply chain management. Despite the recent progress made in learning to solve individual VRP variants, there is a lack of a unified approach that can effectively tackle a wide range of tasks, which is crucial for real-world impact. This paper introduces RouteFinder,… ▽ More Vehicle Routing Problems (VRPs) are optimization problems with significant real-world implications in logistics, transportation, and supply chain management. Despite the recent progress made in learning to solve individual VRP variants, there is a lack of a unified approach that can effectively tackle a wide range of tasks, which is crucial for real-world impact. This paper introduces RouteFinder, a framework for developing foundation models for VRPs. Our key idea is that a foundation model for VRPs should be able to model variants by treating each variant as a subset of a larger VRP problem, equipped with different attributes. We introduce a parallelized environment that can handle any combination of attributes at the same time in a batched manner, and an efficient sampling procedure to train on a mix of problems at each optimization step that can greatly improve convergence robustness. We also introduce novel Global Feature Embeddings that project instance-wise attributes efficiently onto the latent space and help the model understand different VRP variants. Finally, we introduce Efficient Adapter Layers, a simple yet effective technique to finetune pre-trained RouteFinder models to solve novel variants with previously unseen attributes outside of the original feature space. We validate our approach through extensive experiments on 24 VRP variants, demonstrating competitive results over recent multi-task learning models. We make our code openly available at https://github.com/ai4co/routefinder. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.11933 [pdf, other]

Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset

Authors: Fengxiang Wang, Hongzhen Wang, Di Wang, Zonghao Guo, Zhenyu Zhong, Long Lan, Jing Zhang, Zhiyuan Liu, Maosong Sun

Abstract: Masked Image Modeling (MIM) has emerged as a pivotal approach for developing foundational visual models in the field of remote sensing (RS). However, current RS datasets are limited in volume and diversity, which significantly constrains the capacity of MIM methods to learn generalizable representations. In this study, we introduce \textbf{RS-4M}, a large-scale dataset designed to enable highly ef… ▽ More Masked Image Modeling (MIM) has emerged as a pivotal approach for developing foundational visual models in the field of remote sensing (RS). However, current RS datasets are limited in volume and diversity, which significantly constrains the capacity of MIM methods to learn generalizable representations. In this study, we introduce \textbf{RS-4M}, a large-scale dataset designed to enable highly efficient MIM training on RS images. RS-4M comprises 4 million optical images encompassing abundant and fine-grained RS visual tasks, including object-level detection and pixel-level segmentation. Compared to natural images, RS images often contain massive redundant background pixels, which limits the training efficiency of the conventional MIM models. To address this, we propose an efficient MIM method, termed \textbf{SelectiveMAE}, which dynamically encodes and reconstructs a subset of patch tokens selected based on their semantic richness. SelectiveMAE roots in a progressive semantic token selection module, which evolves from reconstructing semantically analogical tokens to encoding complementary semantic dependencies. This approach transforms conventional MIM training into a progressive feature learning process, enabling SelectiveMAE to efficiently learn robust representations of RS images. Extensive experiments show that SelectiveMAE significantly boosts training efficiency by 2.2-2.7 times and enhances the classification, detection, and segmentation performance of the baseline MIM model.The dataset, source code, and trained models will be released. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2405.15056 [pdf, other]

ElastoGen: 4D Generative Elastodynamics

Authors: Yutao Feng, Yintong Shang, Xiang Feng, Lei Lan, Shandian Zhe, Tianjia Shao, Hongzhi Wu, Kun Zhou, Hao Su, Chenfanfu Jiang, Yin Yang

Abstract: We present ElastoGen, a knowledge-driven model that generates physically accurate and coherent 4D elastodynamics. Instead of relying on petabyte-scale data-driven learning, ElastoGen leverages the principles of physics-in-the-loop and learns from established physical knowledge, such as partial differential equations and their numerical solutions. The core idea of ElastoGen is converting the global… ▽ More We present ElastoGen, a knowledge-driven model that generates physically accurate and coherent 4D elastodynamics. Instead of relying on petabyte-scale data-driven learning, ElastoGen leverages the principles of physics-in-the-loop and learns from established physical knowledge, such as partial differential equations and their numerical solutions. The core idea of ElastoGen is converting the global differential operator, corresponding to the nonlinear elastodynamic equations, into iterative local convolution-like operations, which naturally fit modern neural networks. Each network module is specifically designed to support this goal rather than functioning as a black box. As a result, ElastoGen is exceptionally lightweight in terms of both training requirements and network scale. Additionally, due to its alignment with physical procedures, ElastoGen efficiently generates accurate dynamics for a wide range of hyperelastic materials and can be easily integrated with upstream and downstream deep modules to enable end-to-end 4D generation. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.12484 [pdf, other]

Meta-Homogenization for Knitwear Simulation

Authors: Chun Yuan, Kui Wu, Haoyang Shi, Lei Lan, Yuxing Qiu, Cem Yuksel, Huamin Wang, Chenfanfu Jiang, Yin Yang

Abstract: This paper presents meta-homogenization, a spatially varying homogenization scheme for knitwear simulation. We are motivated by the observation that macro-scale fabric dynamics is strongly correlated with its underlying knitting patterns. Therefore, homogenization towards a single material is less effective when the knitting is complex and non-repetitive. Our method tackles this challenge by homog… ▽ More This paper presents meta-homogenization, a spatially varying homogenization scheme for knitwear simulation. We are motivated by the observation that macro-scale fabric dynamics is strongly correlated with its underlying knitting patterns. Therefore, homogenization towards a single material is less effective when the knitting is complex and non-repetitive. Our method tackles this challenge by homogenizing the yarn-level material locally at volumetric elements. Assigning a virtual volume of a knitting structure enables us to model bending and twisting effects via a simple volume-preserving penalty and thus effectively alleviates the material nonlinearity. We employ an adjoint Gauss-Newton formulation to battle the dimensionality challenge of such per-element material optimization. This intuitive material model makes the forward simulation GPU-friendly. To this end, our pipeline also equips a novel domain-decomposed subspace solver crafted for GPU projective dynamics, which makes our simulator hundreds of times faster than the yarn-level simulator. Experiments validate the capability and effectiveness of meta-homogenization. Our method produces realistic animations of knitwear matching the quality of full-scale yarn-level simulations. It is also orders of magnitude faster than existing homogenization techniques in both the training and simulation stages. △ Less

Submitted 23 May, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.11694 [pdf, other]

PBI: Position-Based Dynamics Handles Updated Lagrangian Inelasticity

Authors: Chang Yu, Xuan Li, Lei Lan, Yin Yang, Chenfanfu Jiang

Abstract: Position-based Dynamics (PBD) and its extension, eXtended Position-based Dynamics (XPBD), have been predominantly applied to compliant constrained dynamics, with their potential in finite strain inelasticity remaining underexplored. XPBD stands in contrast to other meshless methods, such as the Material Point Method (MPM). MPM is based on discretizing the weak form of governing partial differentia… ▽ More Position-based Dynamics (PBD) and its extension, eXtended Position-based Dynamics (XPBD), have been predominantly applied to compliant constrained dynamics, with their potential in finite strain inelasticity remaining underexplored. XPBD stands in contrast to other meshless methods, such as the Material Point Method (MPM). MPM is based on discretizing the weak form of governing partial differential equations within a continuum domain, coupled with a hybrid Lagrangian-Eulerian method for tracking deformation gradients. In contrast, XPBD generally entails applying specific constraints, whether hard or compliant, to collections of point masses. This paper revisits this perception, investigating the potential of XPBD in handling inelastic materials that are described with continuum mechanics based yield surfaces and elastoplastic flow rules. Our inspiration is that a robust estimation of the velocity gradient is key to effectively tracking deformation gradients in any meshless context. By further incorporating implicit inelastic constitutive relationships, we introduce an updated Lagrangian augmentation to XPBD. This enhancement enables the simulation of elastoplastic, viscoplastic, and granular substances following their standard constitutive laws. We demonstrate the effectiveness of our method through high-resolution and real-time simulations of diverse materials such as snow, sand, and plasticine, and its integration with standard XPBD simulations of cloth and water. △ Less

Submitted 19 May, 2024; originally announced May 2024.

arXiv:2405.07186 [pdf, other]

Adaptive-TMLE for the Average Treatment Effect based on Randomized Controlled Trial Augmented with Real-World Data

Authors: Mark van der Laan, Sky Qiu, Lars van der Laan

Abstract: We consider the problem of estimating the average treatment effect (ATE) when both randomized control trial (RCT) data and real-world data (RWD) are available. We decompose the ATE estimand as the difference between a pooled-ATE estimand that integrates RCT and RWD and a bias estimand that captures the conditional effect of RCT enrollment on the outcome. We introduce an adaptive targeted minimum l… ▽ More We consider the problem of estimating the average treatment effect (ATE) when both randomized control trial (RCT) data and real-world data (RWD) are available. We decompose the ATE estimand as the difference between a pooled-ATE estimand that integrates RCT and RWD and a bias estimand that captures the conditional effect of RCT enrollment on the outcome. We introduce an adaptive targeted minimum loss-based estimation (A-TMLE) framework to estimate them. We prove that the A-TMLE estimator is root-n-consistent and asymptotically normal. Moreover, in finite sample, it achieves the super-efficiency one would obtain had one known the oracle model for the conditional effect of the RCT enrollment on the outcome. Consequently, the smaller the working model of the bias induced by the RWD is, the greater our estimator's efficiency, while our estimator will always be at least as efficient as an efficient estimator that uses the RCT data only. A-TMLE outperforms existing methods in simulations by having smaller mean-squared-error and 95% confidence intervals. A-TMLE could help utilize RWD to improve the efficiency of randomized trial results without biasing the estimates of intervention effects. This approach could allow for smaller, faster trials, decreasing the time until patients can receive effective treatments. △ Less

Submitted 12 May, 2024; originally announced May 2024.

arXiv:2404.10343 [pdf, other]

The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/. △ Less

Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

arXiv:2403.19272 [pdf, other]

Mil2: Efficient Cloth Simulation Using Non-distance Barriers and Subspace Reuse

Authors: Lei Lan, Zixuan Lu, Jingyi Long, Chun Yuan, Xuan Li, Xiaowei He, Huamin Wang, Chenfanfu Jiang, Yin Yang

Abstract: Mil2 pushes the performance of high-resolution cloth simulation, making the simulation interactive (in milliseconds) for models with one million degrees of freedom (DOFs) while keeping every triangle untangled. The guarantee of being penetration-free is inspired by the interior-point method, which converts the inequality constraints to barrier potentials. Nevertheless, we propose a major overhaul… ▽ More Mil2 pushes the performance of high-resolution cloth simulation, making the simulation interactive (in milliseconds) for models with one million degrees of freedom (DOFs) while keeping every triangle untangled. The guarantee of being penetration-free is inspired by the interior-point method, which converts the inequality constraints to barrier potentials. Nevertheless, we propose a major overhaul of this modality by defining a novel and simple barrier formulation which does not depend on the distance between mesh primitives. Such a non-distance barrier model allows a new way to integrate collision detection into the simulation pipeline. Another contributor to the performance boost comes from the so-called subspace reuse strategy. This is based on the observation that low-frequency strain vibrations are near orthogonal to the deformation induced by collisions or self-collisions, often of high frequency. Subspace reuse then takes care of low-frequency residuals, while high-frequency residuals can also be effectively smoothed by GPU-based iterative solvers. We show that our method outperforms existing fast cloth simulators by nearly one order while keeping the entire simulation penetration-free and producing high-equality animations of high-resolution models. △ Less

Submitted 23 May, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.13795 [pdf, other]

doi 10.1287/ijoc.2023.0055

PyVRP: a high-performance VRP solver package

Authors: Niels A. Wouda, Leon Lan, Wouter Kool

Abstract: We introduce PyVRP, a Python package that implements hybrid genetic search in a state-of-the-art vehicle routing problem (VRP) solver. The package is designed for the VRP with time windows (VRPTW), but can be easily extended to support other VRP variants. PyVRP combines the flexibility of Python with the performance of C++, by implementing (only) performance critical parts of the algorithm in C++,… ▽ More We introduce PyVRP, a Python package that implements hybrid genetic search in a state-of-the-art vehicle routing problem (VRP) solver. The package is designed for the VRP with time windows (VRPTW), but can be easily extended to support other VRP variants. PyVRP combines the flexibility of Python with the performance of C++, by implementing (only) performance critical parts of the algorithm in C++, while being fully customisable at the Python level. PyVRP is a polished implementation of the algorithm that ranked 1st in the 2021 DIMACS VRPTW challenge and, after improvements, ranked 1st on the static variant of the EURO meets NeurIPS 2022 vehicle routing competition. The code follows good software engineering practices, and is well-documented and unit tested. PyVRP is freely available under the liberal MIT license. Through numerical experiments we show that PyVRP achieves state-of-the-art results on the VRPTW and capacitated VRP. We hope that PyVRP enables researchers and practitioners to easily and quickly build on a state-of-the-art VRP solver. △ Less

Submitted 21 March, 2024; v1 submitted 22 November, 2023; originally announced March 2024.

Comments: Pre-print of accepted paper in INFORMS Journal on Computing. 24 pages, 1 figure, 2 listings

arXiv:2403.13241 [pdf, other]

Tackling Noisy Labels with Network Parameter Additive Decomposition

Authors: Jingyi Wang, Xiaobo Xia, Long Lan, Xinghao Wu, Jun Yu, Wenjing Yang, Bo Han, Tongliang Liu

Abstract: Given data with noisy labels, over-parameterized deep networks suffer overfitting mislabeled data, resulting in poor generalization. The memorization effect of deep networks shows that although the networks have the ability to memorize all noisy data, they would first memorize clean training data, and then gradually memorize mislabeled training data. A simple and effective method that exploits the… ▽ More Given data with noisy labels, over-parameterized deep networks suffer overfitting mislabeled data, resulting in poor generalization. The memorization effect of deep networks shows that although the networks have the ability to memorize all noisy data, they would first memorize clean training data, and then gradually memorize mislabeled training data. A simple and effective method that exploits the memorization effect to combat noisy labels is early stopping. However, early stopping cannot distinguish the memorization of clean data and mislabeled data, resulting in the network still inevitably overfitting mislabeled data in the early training stage.In this paper, to decouple the memorization of clean data and mislabeled data, and further reduce the side effect of mislabeled data, we perform additive decomposition on network parameters. Namely, all parameters are additively decomposed into two groups, i.e., parameters $\mathbf{w}$ are decomposed as $\mathbf{w}=\bmσ+\bmγ$. Afterward, the parameters $\bmσ$ are considered to memorize clean data, while the parameters $\bmγ$ are considered to memorize mislabeled data. Benefiting from the memorization effect, the updates of the parameters $\bmσ$ are encouraged to fully memorize clean data in early training, and then discouraged with the increase of training epochs to reduce interference of mislabeled data. The updates of the parameters $\bmγ$ are the opposite. In testing, only the parameters $\bmσ$ are employed to enhance generalization. Extensive experiments on both simulated and real-world benchmarks confirm the superior performance of our method. △ Less

Submitted 28 April, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: Accepted by IEEE T-PAMI

arXiv:2403.08635 [pdf, other]

Human Alignment of Large Language Models through Online Preference Optimisation

Authors: Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot

Abstract: Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contributio… ▽ More Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.08295 [pdf, other]

Gemma: Open Models Based on Gemini Research and Technology

Authors: Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari , et al. (83 additional authors not shown)

Abstract: This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Ge… ▽ More This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations. △ Less

Submitted 16 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.08271 [pdf, other]

Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification

Authors: Long Lan, Fengxiang Wang, Shuyan Li, Xiangtao Zheng, Zengmao Wang, Xinwang Liu

Abstract: Fine-grained ship classification in remote sensing (RS-FGSC) poses a significant challenge due to the high similarity between classes and the limited availability of labeled data, limiting the effectiveness of traditional supervised classification methods. Recent advancements in large pre-trained Vision-Language Models (VLMs) have demonstrated impressive capabilities in few-shot or zero-shot learn… ▽ More Fine-grained ship classification in remote sensing (RS-FGSC) poses a significant challenge due to the high similarity between classes and the limited availability of labeled data, limiting the effectiveness of traditional supervised classification methods. Recent advancements in large pre-trained Vision-Language Models (VLMs) have demonstrated impressive capabilities in few-shot or zero-shot learning, particularly in understanding image content. This study delves into harnessing the potential of VLMs to enhance classification accuracy for unseen ship categories, which holds considerable significance in scenarios with restricted data due to cost or privacy constraints. Directly fine-tuning VLMs for RS-FGSC often encounters the challenge of overfitting the seen classes, resulting in suboptimal generalization to unseen classes, which highlights the difficulty in differentiating complex backgrounds and capturing distinct ship features. To address these issues, we introduce a novel prompt tuning technique that employs a hierarchical, multi-granularity prompt design. Our approach integrates remote sensing ship priors through bias terms, learned from a small trainable network. This strategy enhances the model's generalization capabilities while improving its ability to discern intricate backgrounds and learn discriminative ship features. Furthermore, we contribute to the field by introducing a comprehensive dataset, FGSCM-52, significantly expanding existing datasets with more extensive data and detailed annotations for less common ship classes. Extensive experimental evaluations demonstrate the superiority of our proposed method over current state-of-the-art techniques. The source code will be made publicly available. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.05530 [pdf, other]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content. △ Less

Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2402.07307 [pdf, other]

Self-Consistent Conformal Prediction

Authors: Lars van der Laan, Ahmed M. Alaa

Abstract: In decision-making guided by machine learning, decision-makers may take identical actions in contexts with identical predicted outcomes. Conformal prediction helps decision-makers quantify uncertainty in point predictions of outcomes, allowing for better risk management for actions. Motivated by this perspective, we introduce \textit{Self-Consistent Conformal Prediction} for regression, which comb… ▽ More In decision-making guided by machine learning, decision-makers may take identical actions in contexts with identical predicted outcomes. Conformal prediction helps decision-makers quantify uncertainty in point predictions of outcomes, allowing for better risk management for actions. Motivated by this perspective, we introduce \textit{Self-Consistent Conformal Prediction} for regression, which combines two post-hoc approaches -- Venn-Abers calibration and conformal prediction -- to provide calibrated point predictions and compatible prediction intervals that are valid conditional on model predictions. Our procedure can be applied post-hoc to any black-box model to provide predictions and inferences with finite-sample prediction-conditional guarantees. Numerical experiments show our approach strikes a balance between interval efficiency and conditional validity. △ Less

Submitted 22 April, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

arXiv:2402.01972 [pdf, other]

Combining T-learning and DR-learning: a framework for oracle-efficient estimation of causal contrasts

Authors: Lars van der Laan, Marco Carone, Alex Luedtke

Abstract: We introduce efficient plug-in (EP) learning, a novel framework for the estimation of heterogeneous causal contrasts, such as the conditional average treatment effect and conditional relative risk. The EP-learning framework enjoys the same oracle-efficiency as Neyman-orthogonal learning strategies, such as DR-learning and R-learning, while addressing some of their primary drawbacks, including that… ▽ More We introduce efficient plug-in (EP) learning, a novel framework for the estimation of heterogeneous causal contrasts, such as the conditional average treatment effect and conditional relative risk. The EP-learning framework enjoys the same oracle-efficiency as Neyman-orthogonal learning strategies, such as DR-learning and R-learning, while addressing some of their primary drawbacks, including that (i) their practical applicability can be hindered by loss function non-convexity; and (ii) they may suffer from poor performance and instability due to inverse probability weighting and pseudo-outcomes that violate bounds. To avoid these drawbacks, EP-learner constructs an efficient plug-in estimator of the population risk function for the causal contrast, thereby inheriting the stability and robustness properties of plug-in estimation strategies like T-learning. Under reasonable conditions, EP-learners based on empirical risk minimization are oracle-efficient, exhibiting asymptotic equivalence to the minimizer of an oracle-efficient one-step debiased estimator of the population risk function. In simulation experiments, we illustrate that EP-learners of the conditional average treatment effect and conditional relative risk outperform state-of-the-art competitors, including T-learner, R-learner, and DR-learner. Open-source implementations of the proposed methods are available in our R package hte3. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2402.01057 [pdf, other]

Expert Proximity as Surrogate Rewards for Single Demonstration Imitation Learning

Authors: Chia-Cheng Chiang, Li-Cheng Lan, Wei-Fang Sun, Chien Feng, Cho-Jui Hsieh, Chun-Yi Lee

Abstract: In this paper, we focus on single-demonstration imitation learning (IL), a practical approach for real-world applications where acquiring multiple expert demonstrations is costly or infeasible and the ground truth reward function is not available. In contrast to typical IL settings with multiple demonstrations, single-demonstration IL involves an agent having access to only one expert trajectory.… ▽ More In this paper, we focus on single-demonstration imitation learning (IL), a practical approach for real-world applications where acquiring multiple expert demonstrations is costly or infeasible and the ground truth reward function is not available. In contrast to typical IL settings with multiple demonstrations, single-demonstration IL involves an agent having access to only one expert trajectory. We highlight the issue of sparse reward signals in this setting and propose to mitigate this issue through our proposed Transition Discriminator-based IL (TDIL) method. TDIL is an IRL method designed to address reward sparsity by introducing a denser surrogate reward function that considers environmental dynamics. This surrogate reward function encourages the agent to navigate towards states that are proximal to expert states. In practice, TDIL trains a transition discriminator to differentiate between valid and non-valid transitions in a given environment to compute the surrogate rewards. The experiments demonstrate that TDIL outperforms existing IL approaches and achieves expert-level performance in the single-demonstration IL setting across five widely adopted MuJoCo benchmarks as well as the "Adroit Door" robotic environment. △ Less

Submitted 7 July, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: Published at ICML 2024. Code: https://github.com/stanl1y/tdil

arXiv:2401.04999 [pdf, ps, other]

Can fallback accretion on magnetar model power the X-ray flares simultaneously observed with gamma-rays of Gamma-ray bursts?

Authors: Wen-Yuan Yu, Hou-Jun Lü, Xing Yang, Lin Lan, Zhe Yang

Abstract: The prompt emission, X-ray plateau, and X-ray flares of Gamma-ray bursts (GRB) are thought to be from internal dissipation, and the magnetar as the central engine with propeller fallback accretion is proposed to interpret the observed phenomena of GRBs. In this paper, by systematically searching for X-ray emission observed by Swift/Xry Telescope, we find that seven robust GRBs include both X-ray f… ▽ More The prompt emission, X-ray plateau, and X-ray flares of Gamma-ray bursts (GRB) are thought to be from internal dissipation, and the magnetar as the central engine with propeller fallback accretion is proposed to interpret the observed phenomena of GRBs. In this paper, by systematically searching for X-ray emission observed by Swift/Xry Telescope, we find that seven robust GRBs include both X-ray flares and plateau emissions with measured redshift. More interestingly, the X-ray flares/bumps for those seven GRBs are simultaneously observed in the gamma-ray band. By adopting the propeller fallback accretion model to fit the observed data, it is found that the free parameters of two GRBs (140512A and 180329B) can be constrained very well, while in the other five cases, more or less, they are not all sufficiently constrained. On the other hand, this requires that the conversion efficiency of the propeller is to be two or three times higher than that of the spindown dipole radiation of the magnetar. If this is the case, it is contradictory to the expectation from the propeller model: namely, a dirtier ejecta should be less efficient in producing gamma-ray emissions. Our results hint that at least the magnetar central engine with propeller fallback accretion model cannot interpret very well both the GRB X-ray flares simultaneously observed in the gamma-ray band and the X-ray flares of GRBs with a high Lorentz factor. △ Less

Submitted 16 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: 13 pages, 5 figures, 1 table, updated references, match with published version

arXiv:2401.04577 [pdf, other]

Masked Audio Generation using a Single Non-Autoregressive Transformer

Authors: Alon Ziv, Itai Gat, Gael Le Lan, Tal Remez, Felix Kreuk, Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

Abstract: We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. T… ▽ More We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. The proposed approach is comparable to the evaluated baselines, while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling, considering latency, throughput, and generation quality. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT. △ Less

Submitted 5 March, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

arXiv:2401.00722 [pdf, other]

BRAU-Net++: U-Shaped Hybrid CNN-Transformer Network for Medical Image Segmentation

Authors: Libin Lan, Pengzhou Cai, Lu Jiang, Xiaojuan Liu, Yongmei Li, Yudong Zhang

Abstract: Accurate medical image segmentation is essential for clinical quantification, disease diagnosis, treatment planning and many other applications. Both convolution-based and transformer-based u-shaped architectures have made significant success in various medical image segmentation tasks. The former can efficiently learn local information of images while requiring much more image-specific inductive… ▽ More Accurate medical image segmentation is essential for clinical quantification, disease diagnosis, treatment planning and many other applications. Both convolution-based and transformer-based u-shaped architectures have made significant success in various medical image segmentation tasks. The former can efficiently learn local information of images while requiring much more image-specific inductive biases inherent to convolution operation. The latter can effectively capture long-range dependency at different feature scales using self-attention, whereas it typically encounters the challenges of quadratic compute and memory requirements with sequence length increasing. To address this problem, through integrating the merits of these two paradigms in a well-designed u-shaped architecture, we propose a hybrid yet effective CNN-Transformer network, named BRAU-Net++, for an accurate medical image segmentation task. Specifically, BRAU-Net++ uses bi-level routing attention as the core building block to design our u-shaped encoder-decoder structure, in which both encoder and decoder are hierarchically constructed, so as to learn global semantic information while reducing computational complexity. Furthermore, this network restructures skip connection by incorporating channel-spatial attention which adopts convolution operations, aiming to minimize local spatial information loss and amplify global dimension-interaction of multi-scale features. Extensive experiments on three public benchmark datasets demonstrate that our proposed approach surpasses other state-of-the-art methods including its baseline: BRAU-Net under almost all evaluation metrics. We achieve the average Dice-Similarity Coefficient (DSC) of 82.47, 90.10, and 92.94 on Synapse multi-organ segmentation, ISIC-2018 Challenge, and CVC-ClinicDB, as well as the mIoU of 84.01 and 88.17 on ISIC-2018 Challenge and CVC-ClinicDB, respectively. △ Less

Submitted 1 January, 2024; originally announced January 2024.

Comments: 12 pages, 6 figures, 9 tables code: https://github.com/Caipengzhou/BRAU-Netplusplus

arXiv:2312.15136 [pdf, other]

Towards End-to-End Structure Solutions from Information-Compromised Diffraction Data via Generative Deep Learning

Authors: Gabe Guo, Judah Goldfeder, Ling Lan, Aniv Ray, Albert Hanming Yang, Boyuan Chen, Simon JL Billinge, Hod Lipson

Abstract: The revolution in materials in the past century was built on a knowledge of the atomic arrangements and the structure-property relationship. The sine qua non for obtaining quantitative structural information is single crystal crystallography. However, increasingly we need to solve structures in cases where the information content in our input signal is significantly degraded, for example, due to o… ▽ More The revolution in materials in the past century was built on a knowledge of the atomic arrangements and the structure-property relationship. The sine qua non for obtaining quantitative structural information is single crystal crystallography. However, increasingly we need to solve structures in cases where the information content in our input signal is significantly degraded, for example, due to orientational averaging of grains, finite size effects due to nanostructure, and mixed signals due to sample heterogeneity. Understanding the structure property relationships in such situations is, if anything, more important and insightful, yet we do not have robust approaches for accomplishing it. In principle, machine learning (ML) and deep learning (DL) are promising approaches since they augment information in the degraded input signal with prior knowledge learned from large databases of already known structures. Here we present a novel ML approach, a variational query-based multi-branch deep neural network that has the promise to be a robust but general tool to address this problem end-to-end. We demonstrate the approach on computed powder x-ray diffraction (PXRD), along with partial chemical composition information, as input. We choose as a structural representation a modified electron density we call the Cartesian mapped electron density (CMED), that straightforwardly allows our ML model to learn material structures across different chemistries, symmetries and crystal systems. When evaluated on theoretically simulated data for the cubic and trigonal crystal systems, the system achieves up to $93.4\%$ average similarity with the ground truth on unseen materials, both with known and partially-known chemical composition information, showing great promise for successful structure solution even from degraded and incomplete input data. △ Less

Submitted 22 December, 2023; originally announced December 2023.

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.07919 [pdf, ps, other]

Double Neutron Star Mergers: Are Late-time Radio Signals Overestimated?

Authors: Shao-Ze Li, Yun-Wei Yu, He Gao, Lin Lan

Abstract: The coalescence of binary neutron stars can yield the expulsion of a fast-moving, quasi-isotropic material, which may induce thermal radiation and give rise to kilonova emission. Moreover, the interaction between the ejected material and the surrounding environment generates an external shock, which can result in a long-lasting radio signal that persists for several decades following the merger. I… ▽ More The coalescence of binary neutron stars can yield the expulsion of a fast-moving, quasi-isotropic material, which may induce thermal radiation and give rise to kilonova emission. Moreover, the interaction between the ejected material and the surrounding environment generates an external shock, which can result in a long-lasting radio signal that persists for several decades following the merger. In contrast to supernova ejecta, kilonova ejecta exhibits a relatively lesser mass and higher velocity, and its expansion may ultimately result in the ejecta density becoming so low that the medium particle can freely pass through the ejecta. Thereby it would lead to a kind of incomplete sweeping on the interstellar medium. Employing a toy model, our investigation reveals that such incomplete sweeping may considerably diminish the late-time radio radiation power, irrespective of whether the binary neutron star merger results in the formation of a black hole or a neutron star. Our findings, thus, imply that the previously reported radio upper limits for certain short gamma-ray bursts may not necessarily place stringent constraints on the presence of a long-lived magnetar remnant in these short GRBs. △ Less

Submitted 28 February, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: 8 pages, 7 figures, accepted for publication in ApJ

arXiv:2312.02715 [pdf, other]

A queueing-based approach for integrated routing and appointment scheduling

Authors: René Bekker, Bharti Bharti, Leon Lan, Michel Mandjes

Abstract: This paper aims to address the integrated routing and appointment scheduling (RAS) problem for a single service provider. The RAS problem is an operational challenge faced by operators that provide services requiring home attendance, such as grocery delivery, home healthcare, or maintenance services. While considering the inherently random nature of service and travel times, the goal is to minimiz… ▽ More This paper aims to address the integrated routing and appointment scheduling (RAS) problem for a single service provider. The RAS problem is an operational challenge faced by operators that provide services requiring home attendance, such as grocery delivery, home healthcare, or maintenance services. While considering the inherently random nature of service and travel times, the goal is to minimize a weighted sum of the operator's travel times and idle time, and the client's waiting times. To handle the complex search space of routing and appointment scheduling decisions, we propose a queueing-based approach to effectively deal with the appointment scheduling decisions. We use two well-known approximations from queueing theory: first, we use an approach based on phase-type distributions to accurately approximate the objective function, and second, we use an heavy-traffic approximation to derive an efficient procedure to obtain good appointment schedules. Combining these two approaches results in a fast and sufficiently accurate hybrid approximation, thus essentially reducing RAS to a routing problem. Moreover, we propose the use a simple yet effective large neighborhood search metaheuristic to explore the space of routing decisions. The effectiveness of our proposed methodology is tested on benchmark instances with up to 40 clients, demonstrating an efficient and accurate methodology for integrated routing and appointment scheduling. △ Less

Submitted 5 December, 2023; originally announced December 2023.

Comments: 25 pages, 10 figures

arXiv:2311.15540 [pdf]

EAFP-Med: An Efficient Adaptive Feature Processing Module Based on Prompts for Medical Image Detection

Authors: Xiang Li, Long Lan, Husam Lahza, Shaowu Yang, Shuihua Wang, Wenjing Yang, Hengzhu Liu, Yudong Zhang

Abstract: In the face of rapid advances in medical imaging, cross-domain adaptive medical image detection is challenging due to the differences in lesion representations across various medical imaging technologies. To address this issue, we draw inspiration from large language models to propose EAFP-Med, an efficient adaptive feature processing module based on prompts for medical image detection. EAFP-Med c… ▽ More In the face of rapid advances in medical imaging, cross-domain adaptive medical image detection is challenging due to the differences in lesion representations across various medical imaging technologies. To address this issue, we draw inspiration from large language models to propose EAFP-Med, an efficient adaptive feature processing module based on prompts for medical image detection. EAFP-Med can efficiently extract lesion features of different scales from a diverse range of medical images based on prompts while being flexible and not limited by specific imaging techniques. Furthermore, it serves as a feature preprocessing module that can be connected to any model front-end to enhance the lesion features in input images. Moreover, we propose a novel adaptive disease detection model named EAFP-Med ST, which utilizes the Swin Transformer V2 - Tiny (SwinV2-T) as its backbone and connects it to EAFP-Med. We have compared our method to nine state-of-the-art methods. Experimental results demonstrate that EAFP-Med ST achieves the best performance on all three datasets (chest X-ray images, cranial magnetic resonance imaging images, and skin images). EAFP-Med can efficiently extract lesion features from various medical images based on prompts, enhancing the model's performance. This holds significant potential for improving medical image analysis and diagnosis. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2311.15539 [pdf]

A Novel Human-Based Meta-Heuristic Algorithm: Dragon Boat Optimization

Authors: Xiang Li, Long Lan, Husam Lahza, Shaowu Yang, Shuihua Wang, Wenjing Yang, Hengzhu Liu, Yudong Zhang

Abstract: (Aim) Dragon Boat Racing, a popular aquatic folklore team sport, is traditionally held during the Dragon Boat Festival. Inspired by this event, we propose a novel human-based meta-heuristic algorithm called dragon boat optimization (DBO) in this paper. (Method) It models the unique behaviors of each crew member on the dragon boat during the race by introducing social psychology mechanisms (social… ▽ More (Aim) Dragon Boat Racing, a popular aquatic folklore team sport, is traditionally held during the Dragon Boat Festival. Inspired by this event, we propose a novel human-based meta-heuristic algorithm called dragon boat optimization (DBO) in this paper. (Method) It models the unique behaviors of each crew member on the dragon boat during the race by introducing social psychology mechanisms (social loafing, social incentive). Throughout this process, the focus is on the interaction and collaboration among the crew members, as well as their decision-making in different situations. During each iteration, DBO implements different state updating strategies. By modelling the crew's behavior and adjusting the state updating strategies, DBO is able to maintain high-performance efficiency. (Results) We have tested the DBO algorithm with 29 mathematical optimization problems and 2 structural design problems. (Conclusion) The experimental results demonstrate that DBO is competitive with state-of-the-art meta-heuristic algorithms as well as conventional methods. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2311.15173 [pdf, other]

Stretched Non-negative Matrix Factorization

Authors: Ran Gu, Yevgeny Rakita, Ling Lan, Zach Thatcher, Gabrielle E. Kamm, Daniel O'Nolan, Brennan Mcbride, Allison Wustrow, James R. Neilson, Karena W. Chapman, Qiang Du, Simon J. L. Billinge

Abstract: An algorithm is described and tested that carries out a non negative matrix factorization (NMF) ignoring any stretching of the signal along the axis of the independent variable. This extended NMF model is called StretchedNMF. Variability in a set of signals due to this stretching is then ignored in the decomposition. This can be used, for example, to study sets of powder diffraction data collected… ▽ More An algorithm is described and tested that carries out a non negative matrix factorization (NMF) ignoring any stretching of the signal along the axis of the independent variable. This extended NMF model is called StretchedNMF. Variability in a set of signals due to this stretching is then ignored in the decomposition. This can be used, for example, to study sets of powder diffraction data collected at different temperatures where the materials are undergoing thermal expansion. It gives a more meaningful decomposition in this case where the component signals resemble signals from chemical components in the sample. The StretchedNMF model introduces a new variable, the stretching factor, to describe any expansion of the signal. To solve StretchedNMF, we discretize it and employ Block Coordinate Descent framework algorithms. The initial experimental results indicate that StretchedNMF model outperforms the conventional NMF for sets of data with such an expansion. A further enhancement to StretchedNMF for the case of powder diffraction data from crystalline materials called Sparse-StretchedNMF, which makes use of the sparsity of the powder diffraction signals, allows correct extractions even for very small stretches where StretchedNMF struggles. As well as demonstrating the model performance on simulated PXRD patterns and atomic pair distribution functions (PDFs), it also proved successful when applied to real data taken from an in situ chemical reaction experiment. △ Less

Submitted 25 November, 2023; originally announced November 2023.

Comments: 39 pages, 16 figures

arXiv:2311.09654 [pdf, other]

On the possibility to detect gravitational waves from post-merger super-massive neutron stars with a kilohertz detector

Authors: Yikang Chen, Bin Liu, Shunke Ai, Lin Lan, He Gao, Yong Yuan, Zong-Hong Zhu

Abstract: The detection of a secular post-merger gravitational wave (GW) signal in a binary neutron star (BNS) merger serves as strong evidence for the formation of a long-lived post-merger neutron star (NS), which can help constrain the maximum mass of NSs and differentiate NS equation of states. We specifically focus on the detection of GW emissions from rigidly rotating NSs formed through BNS mergers, us… ▽ More The detection of a secular post-merger gravitational wave (GW) signal in a binary neutron star (BNS) merger serves as strong evidence for the formation of a long-lived post-merger neutron star (NS), which can help constrain the maximum mass of NSs and differentiate NS equation of states. We specifically focus on the detection of GW emissions from rigidly rotating NSs formed through BNS mergers, using several kilohertz GW detectors that have been designed. We simulate the BNS mergers within the detecting limit of LIGO-Virgo-KARGA O4 and attempt to find out on what fraction the simulated sources may have a detectable secular post-merger GW signal. For kilohertz detectors designed in the same configuration of LIGO A+, we find that the design with peak sensitivity at approximately $2{\rm kHz}$ is most appropriate for such signals. The fraction of sources that have a detectable secular post-merger GW signal would be approximately $0.94\% - 11\%$ when the spindowns of the post-merger rigidly rotating NSs are dominated by GW radiation, while be approximately $0.46\% - 1.6\%$ when the contribution of electromagnetic (EM) radiation to the spin-down processes is non-negligible. We also estimate this fraction based on other well-known proposed kilohertz GW detectors and find that, with advanced design, it can reach approximately $12\% - 45\%$ for the GW-dominated spindown case and $4.7\% - 16\%$ when both the GW and EM radiations are considered. △ Less

Submitted 16 November, 2023; originally announced November 2023.

Comments: 10 pages, 6 figures, 4 tables, accepted for publication on MNRAS

arXiv:2311.00895 [pdf, other]

In-Context Prompt Editing For Conditional Audio Generation

Authors: Ernie Chang, Pin-Jie Lin, Yang Li, Sidd Srinivasan, Gael Le Lan, David Kant, Yangyang Shi, Forrest Iandola, Vikas Chandra

Abstract: Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional au… ▽ More Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional audio generation in the wild as user prompts are under-specified. In particular, we observe a consistent audio quality degradation in generated audio samples with user prompts, as opposed to training set prompts. To this end, we present a retrieval-based in-context prompt editing framework that leverages the training captions as demonstrative exemplars to revisit the user prompts. We show that the framework enhanced the audio quality across the set of collected user prompts, which were edited with reference to the training captions as exemplars. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: 5 pages, 3 figures, 2 tables

arXiv:2310.11093 [pdf, other]

SODA: Robust Training of Test-Time Data Adaptors

Authors: Zige Wang, Yonggang Zhang, Zhen Fang, Long Lan, Wenjing Yang, Bo Han

Abstract: Adapting models deployed to test distributions can mitigate the performance degradation caused by distribution shifts. However, privacy concerns may render model parameters inaccessible. One promising approach involves utilizing zeroth-order optimization (ZOO) to train a data adaptor to adapt the test data to fit the deployed models. Nevertheless, the data adaptor trained with ZOO typically brings… ▽ More Adapting models deployed to test distributions can mitigate the performance degradation caused by distribution shifts. However, privacy concerns may render model parameters inaccessible. One promising approach involves utilizing zeroth-order optimization (ZOO) to train a data adaptor to adapt the test data to fit the deployed models. Nevertheless, the data adaptor trained with ZOO typically brings restricted improvements due to the potential corruption of data features caused by the data adaptor. To address this issue, we revisit ZOO in the context of test-time data adaptation. We find that the issue directly stems from the unreliable estimation of the gradients used to optimize the data adaptor, which is inherently due to the unreliable nature of the pseudo-labels assigned to the test data. Based on this observation, we propose pseudo-label-robust data adaptation (SODA) to improve the performance of data adaptation. Specifically, SODA leverages high-confidence predicted labels as reliable labels to optimize the data adaptor with ZOO for label prediction. For data with low-confidence predictions, SODA encourages the adaptor to preserve data information to mitigate data corruption. Empirical results indicate that SODA can significantly enhance the performance of deployed models in the presence of distribution shifts without requiring access to model parameters. △ Less

Submitted 17 October, 2023; originally announced October 2023.

arXiv:2310.09926 [pdf, other]

Estimating Uncertainty in Multimodal Foundation Models using Public Internet Data

Authors: Shiladitya Dutta, Hongbo Wei, Lars van der Laan, Ahmed M. Alaa

Abstract: Foundation models are trained on vast amounts of data at scale using self-supervised learning, enabling adaptation to a wide range of downstream tasks. At test time, these models exhibit zero-shot capabilities through which they can classify previously unseen (user-specified) categories. In this paper, we address the problem of quantifying uncertainty in these zero-shot predictions. We propose a h… ▽ More Foundation models are trained on vast amounts of data at scale using self-supervised learning, enabling adaptation to a wide range of downstream tasks. At test time, these models exhibit zero-shot capabilities through which they can classify previously unseen (user-specified) categories. In this paper, we address the problem of quantifying uncertainty in these zero-shot predictions. We propose a heuristic approach for uncertainty estimation in zero-shot settings using conformal prediction with web data. Given a set of classes at test time, we conduct zero-shot classification with CLIP-style models using a prompt template, e.g., "an image of a <category>", and use the same template as a search query to source calibration data from the open web. Given a web-based calibration set, we apply conformal prediction with a novel conformity score that accounts for potential errors in retrieved web data. We evaluate the utility of our proposed method in Biomedical foundation models; our preliminary results show that web-based conformal prediction sets achieve the target coverage with satisfactory efficiency on a variety of biomedical datasets. △ Less

Submitted 26 November, 2023; v1 submitted 15 October, 2023; originally announced October 2023.

arXiv:2310.07133 [pdf, other]

What constraints can one pose on the maximum mass of neutron stars from multi-messenger observations?

Authors: Shunke Ai, He Gao, Yong Yuan, Bing Zhang, Lin Lan

Abstract: The maximum mass of neutron stars ($M_{\rm TOV}$) plays a crucial role in understanding their equation of state (EoS). Previous studies have used the measurements for the compactness of massive pulsars and the tidal deformability of neutron stars in binary neutron star (BNS) mergers to constrain the EoS and thus the $M_{\rm TOV}$. The discovery of the most massive pulsar, PSR J0952-0607, with a ma… ▽ More The maximum mass of neutron stars ($M_{\rm TOV}$) plays a crucial role in understanding their equation of state (EoS). Previous studies have used the measurements for the compactness of massive pulsars and the tidal deformability of neutron stars in binary neutron star (BNS) mergers to constrain the EoS and thus the $M_{\rm TOV}$. The discovery of the most massive pulsar, PSR J0952-0607, with a mass $\sim 2.35M_{\odot}$, has provided a valuable lower limit for $M_{\rm TOV}$. Another efficient method to constrain $M_{\rm TOV}$ is by examining the type of central remnant formed after a BNS merger. Gravitational wave (GW) data can provide the total mass of the system, while accompanying electromagnetic signals can help infer the remnant type. In this study, we combine all the previous constraints and utilize the observational facts that about $24\%$ of the short gamma-ray bursts are followed by an X-ray internal plateau, which indicate that roughly this fraction of BNS mergers yield supermassive neutron stars, to perform (Markov Chain) Monte Carlo simulations. These simulations allow us to explore the probability density distribution of $M_{\rm TOV}$ and other parameters related to BNS mergers. Our findings suggest that $M_{\rm TOV}$ is likely around $2.49M_{\odot} - 2.52M_{\odot}$, with an uncertainty range of approximately [$-0.16M_{\odot}$, $0.15M_{\odot}$] ([$-0.28M_{\odot}$, $0.26M_{\odot}$]) at $1σ$ ($2σ$) confidence level. Furthermore, we examine the type of merger remnants in specific events like GW170817 and GW190425 to further constrain $M_{\rm TOV}$ and other relevant parameters, which can help to understand the physical processes involved in BNS mergers. △ Less

Submitted 10 October, 2023; originally announced October 2023.

Comments: 14 pages, 10 figures, accepted for publication on MNRAS

arXiv:2310.00289 [pdf, other]

Pubic Symphysis-Fetal Head Segmentation Using Pure Transformer with Bi-level Routing Attention

Authors: Pengzhou Cai, Jiang Lu, Yanxin Li, Libin Lan

Abstract: In this paper, we propose a method, named BRAU-Net, to solve the pubic symphysis-fetal head segmentation task. The method adopts a U-Net-like pure Transformer architecture with bi-level routing attention and skip connections, which effectively learns local-global semantic information. The proposed BRAU-Net was evaluated on transperineal Ultrasound images dataset from the pubic symphysis-fetal head… ▽ More In this paper, we propose a method, named BRAU-Net, to solve the pubic symphysis-fetal head segmentation task. The method adopts a U-Net-like pure Transformer architecture with bi-level routing attention and skip connections, which effectively learns local-global semantic information. The proposed BRAU-Net was evaluated on transperineal Ultrasound images dataset from the pubic symphysis-fetal head segmentation and angle of progression (FH-PS-AOP) challenge. The results demonstrate that the proposed BRAU-Net achieves comparable a final score. The codes will be available at https://github.com/Caipengzhou/BRAU-Net. △ Less

Submitted 7 October, 2023; v1 submitted 30 September, 2023; originally announced October 2023.

arXiv:2309.10795 [pdf, other]

Exploring Speech Enhancement for Low-resource Speech Synthesis

Authors: Zhaoheng Ni, Sravya Popuri, Ning Dong, Kohei Saijo, Xiaohui Zhang, Gael Le Lan, Yangyang Shi, Vikas Chandra, Changhan Wang

Abstract: High-quality and intelligible speech is essential to text-to-speech (TTS) model training, however, obtaining high-quality data for low-resource languages is challenging and expensive. Applying speech enhancement on Automatic Speech Recognition (ASR) corpus mitigates the issue by augmenting the training data, while how the nonlinear speech distortion brought by speech enhancement models affects TTS… ▽ More High-quality and intelligible speech is essential to text-to-speech (TTS) model training, however, obtaining high-quality data for low-resource languages is challenging and expensive. Applying speech enhancement on Automatic Speech Recognition (ASR) corpus mitigates the issue by augmenting the training data, while how the nonlinear speech distortion brought by speech enhancement models affects TTS training still needs to be investigated. In this paper, we train a TF-GridNet speech enhancement model and apply it to low-resource datasets that were collected for the ASR task, then train a discrete unit based TTS model on the enhanced speech. We use Arabic datasets as an example and show that the proposed pipeline significantly improves the low-resource TTS system compared with other baseline methods in terms of ASR WER metric. We also run empirical analysis on the correlation between speech enhancement and TTS performances. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: Submitted to ICASSP 2024

arXiv:2309.10537 [pdf, other]

FoleyGen: Visually-Guided Audio Generation

Authors: Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra

Abstract: Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we… ▽ More Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we introduce FoleyGen, an open-domain V2A generation system built on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural audio codec for bidirectional conversion between waveforms and discrete tokens. The generation of audio tokens is facilitated by a single Transformer model, which is conditioned on visual features extracted from a visual encoder. A prevalent problem in V2A generation is the misalignment of generated audio with the visible actions in the video. To address this, we explore three novel visual attention mechanisms. We further undertake an exhaustive evaluation of multiple visual encoders, each pretrained on either single-modal or multi-modal tasks. The experimental results on VGGSound dataset show that our proposed FoleyGen outperforms previous systems across all objective metrics and human evaluations. △ Less

Submitted 19 September, 2023; originally announced September 2023.

arXiv:2309.08804 [pdf, other]

Stack-and-Delay: a new codebook pattern for music generation

Authors: Gael Le Lan, Varun Nagaraja, Ernie Chang, David Kant, Zhaoheng Ni, Yangyang Shi, Forrest Iandola, Vikas Chandra

Abstract: In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, flattening the codebooks represents the highest quality decoding strategy, while being notoriously slow. To this end, we propose a novel stack-and-delay… ▽ More In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, flattening the codebooks represents the highest quality decoding strategy, while being notoriously slow. To this end, we propose a novel stack-and-delay style of decoding strategy to improve upon the flat pattern decoding where generation speed is four times faster as opposed to vanilla flat decoding. This brings the inference time close to that of the delay decoding strategy, and allows for faster inference on GPU for small batch sizes. For the same inference efficiency budget as the delay pattern, we show that the proposed approach performs better in objective evaluations, almost closing the gap with the flat pattern in terms of quality. The results are corroborated by subjective evaluations which show that samples generated by the new model are slightly more often preferred to samples generated by the competing model given the same text prompts. △ Less

Submitted 15 September, 2023; originally announced September 2023.

arXiv:2309.08773 [pdf, other]

Enhance audio generation controllability through representation similarity regularization

Authors: Yangyang Shi, Gael Le Lan, Varun Nagaraja, Zhaoheng Ni, Xinhao Mei, Ernie Chang, Forrest Iandola, Yang Liu, Vikas Chandra

Abstract: This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverages input from both textual and audio token representations to predict subsequent audio tokens. However, the current configuration lacks explicit regula… ▽ More This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverages input from both textual and audio token representations to predict subsequent audio tokens. However, the current configuration lacks explicit regularization to ensure the alignment between the chosen text representation and the language model's predictions. Our proposal involves the incorporation of audio and text representation regularization, particularly during the classifier-free guidance (CFG) phase, where the text condition is excluded from cross attention during language model training. The aim of this proposed representation regularization is to minimize discrepancies in audio and text similarity compared to other samples within the same training batch. Experimental results on both music and audio generation tasks demonstrate that our proposed methods lead to improvements in objective metrics for both audio and music generation, as well as an enhancement in the human perception for audio generation. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: 5 pages

arXiv:2308.14476 [pdf, other]

doi 10.1287/trsc.2023.0111

An iterative sample scenario approach for the dynamic dispatch waves problem

Authors: Leon Lan, Jasper van Doorn, Niels A. Wouda, Arpan Rijal, Sandjai Bhulai

Abstract: A challenge in same-day delivery operations is that delivery requests are typically not known beforehand, but are instead revealed dynamically during the day. This uncertainty introduces a trade-off between dispatching vehicles to serve requests as soon as they are revealed to ensure timely delivery, and delaying the dispatching decision to consolidate routing decisions with future, currently unkn… ▽ More A challenge in same-day delivery operations is that delivery requests are typically not known beforehand, but are instead revealed dynamically during the day. This uncertainty introduces a trade-off between dispatching vehicles to serve requests as soon as they are revealed to ensure timely delivery, and delaying the dispatching decision to consolidate routing decisions with future, currently unknown requests. In this paper, we study the dynamic dispatch waves problem, a same-day delivery problem in which vehicles are dispatched at fixed decision moments. At each decision moment, the system operator must decide which of the known requests to dispatch, and how to route these dispatched requests. The operator's goal is to minimize the total routing cost while ensuring that all requests are served on time. We propose iterative conditional dispatch (ICD), an iterative solution construction procedure based on a sample scenario approach. ICD iteratively solves sample scenarios to classify requests to be dispatched, postponed, or undecided. The set of undecided requests shrinks in each iteration until a final dispatching decision is made in the last iteration. We develop two variants of ICD: one variant based on thresholds, and another variant based on similarity. A significant strength of ICD is that it is conceptually simple and easy to implement. This simplicity does not harm performance: through rigorous numerical experiments, we show that both variants efficiently navigate the large state and action spaces of the dynamic dispatch waves problem and quickly converge to a high-quality solution. Finally, we demonstrate that the threshold-based ICD variant achieves excellent results on instances from the EURO meets NeurIPS 2022 vehicle routing competition, nearly matching the performance of the winning machine learning-based strategy. △ Less

Submitted 21 March, 2024; v1 submitted 28 August, 2023; originally announced August 2023.

arXiv:2307.13981 [pdf, other]

Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models

Authors: Wei Sun, Wen Wen, Xiongkuo Min, Long Lan, Guangtao Zhai, Kede Ma

Abstract: Blind video quality assessment (BVQA) plays an indispensable role in monitoring and improving the end-users' viewing experience in various real-world video-enabled media applications. As an experimental field, the improvements of BVQA models have been measured primarily on a few human-rated VQA datasets. Thus, it is crucial to gain a better understanding of existing VQA datasets in order to proper… ▽ More Blind video quality assessment (BVQA) plays an indispensable role in monitoring and improving the end-users' viewing experience in various real-world video-enabled media applications. As an experimental field, the improvements of BVQA models have been measured primarily on a few human-rated VQA datasets. Thus, it is crucial to gain a better understanding of existing VQA datasets in order to properly evaluate the current progress in BVQA. Towards this goal, we conduct a first-of-its-kind computational analysis of VQA datasets via designing minimalistic BVQA models. By minimalistic, we restrict our family of BVQA models to build only upon basic blocks: a video preprocessor (for aggressive spatiotemporal downsampling), a spatial quality analyzer, an optional temporal quality analyzer, and a quality regressor, all with the simplest possible instantiations. By comparing the quality prediction performance of different model variants on eight VQA datasets with realistic distortions, we find that nearly all datasets suffer from the easy dataset problem of varying severity, some of which even admit blind image quality assessment (BIQA) solutions. We additionally justify our claims by contrasting our model generalizability on these VQA datasets, and by ablating a dizzying set of BVQA design choices related to the basic building blocks. Our results cast doubt on the current progress in BVQA, and meanwhile shed light on good practices of constructing next-generation VQA datasets and models. △ Less

Submitted 3 April, 2024; v1 submitted 26 July, 2023; originally announced July 2023.

arXiv:2307.12544 [pdf, other]

Adaptive debiased machine learning using data-driven model selection techniques

Authors: Lars van der Laan, Marco Carone, Alex Luedtke, Mark van der Laan

Abstract: Debiased machine learning estimators for nonparametric inference of smooth functionals of the data-generating distribution can suffer from excessive variability and instability. For this reason, practitioners may resort to simpler models based on parametric or semiparametric assumptions. However, such simplifying assumptions may fail to hold, and estimates may then be biased due to model misspecif… ▽ More Debiased machine learning estimators for nonparametric inference of smooth functionals of the data-generating distribution can suffer from excessive variability and instability. For this reason, practitioners may resort to simpler models based on parametric or semiparametric assumptions. However, such simplifying assumptions may fail to hold, and estimates may then be biased due to model misspecification. To address this problem, we propose Adaptive Debiased Machine Learning (ADML), a nonparametric framework that combines data-driven model selection and debiased machine learning techniques to construct asymptotically linear, adaptive, and superefficient estimators for pathwise differentiable functionals. By learning model structure directly from data, ADML avoids the bias introduced by model misspecification and remains free from the restrictions of parametric and semiparametric models. While they may exhibit irregular behavior for the target parameter in a nonparametric statistical model, we demonstrate that ADML estimators provides regular and locally uniformly valid inference for a projection-based oracle parameter. Importantly, this oracle parameter agrees with the original target parameter for distributions within an unknown but correctly specified oracle statistical submodel that is learned from the data. This finding implies that there is no penalty, in a local asymptotic sense, for conducting data-driven model selection compared to having prior knowledge of the oracle submodel and oracle parameter. To demonstrate the practical applicability of our theory, we provide a broad class of ADML estimators for estimating the average treatment effect in adaptive partially linear regression models. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: 32 pages + appendix

arXiv:2307.00997 [pdf, other]

RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation

Authors: Yonglin Li, Jing Zhang, Xiao Teng, Long Lan

Abstract: The Segment Anything Model (SAM) has gained significant attention for its impressive performance in image segmentation. However, it lacks proficiency in referring video object segmentation (RVOS) due to the need for precise user-interactive prompts and a limited understanding of different modalities, such as language and vision. This paper presents the RefSAM model, which explores the potential of… ▽ More The Segment Anything Model (SAM) has gained significant attention for its impressive performance in image segmentation. However, it lacks proficiency in referring video object segmentation (RVOS) due to the need for precise user-interactive prompts and a limited understanding of different modalities, such as language and vision. This paper presents the RefSAM model, which explores the potential of SAM for RVOS by incorporating multi-view information from diverse modalities and successive frames at different timestamps in an online manner. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-Modal MLP that projects the text embedding of the referring expression into sparse and dense embeddings, serving as user-interactive prompts. Additionally, we have introduced the hierarchical dense attention module to fuse hierarchical visual semantic information with sparse embeddings in order to obtain fine-grained dense embeddings, and an implicit tracking module to generate a track token and provide historical information for the mask decoder. Furthermore, we employ a parameter-efficient tuning strategy to effectively align and fuse the language and vision features. Through comprehensive ablation studies, we demonstrate the practical and effective design choices of our model. Extensive experiments conducted on Ref-Youtu-VOS, Ref-DAVIS17, and three referring image segmentation datasets validate the superiority and effectiveness of our RefSAM model over existing methods. The code and models will be made publicly at \href{https://github.com/LancasterLi/RefSAM}{github.com/LancasterLi/RefSAM}. △ Less

Submitted 1 October, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

Comments: The code and models will be made publicly at https://github.com/LancasterLi/RefSAM

arXiv:2306.16885 [pdf, other]

Optimality in superselective surface binding by multivalent DNA nanostars

Authors: Christine Linne, Eva Heemskerk, Jos Zwanikken, Daniela J. Kraft, Liedewij Laan

Abstract: Weak multivalent interactions govern a large variety of biological processes like cell-cell adhesion and virus-host interactions. These systems distinguish sharply between surfaces based on receptor density, known as superselectivity. Earlier experimental and theoretical work provided insights into the control of selectivity: Weak interactions and a high number of ligands facilitate superselectivi… ▽ More Weak multivalent interactions govern a large variety of biological processes like cell-cell adhesion and virus-host interactions. These systems distinguish sharply between surfaces based on receptor density, known as superselectivity. Earlier experimental and theoretical work provided insights into the control of selectivity: Weak interactions and a high number of ligands facilitate superselectivity. Present experimental studies typically involve tens or hundreds of interactions, resulting in a high entropic contribution leading to high selectivities. However, if, and if so how, systems with few ligands, such as multi-domain proteins and virus binding to a membrane, show superselective behavior is an open question. Here, we address this question with a multivalent experimental model system based on star shaped branched DNA nanostructures (DNA nanostars) with each branch featuring a single stranded overhang that binds to complementary receptors on a target surface. Each DNA nanostar possesses a fluorophore, to directly visualize DNA nanostar surface adsorption by total internal reflection fluorescence microscopy (TIRFM). We observe that DNA nanostars can bind superselectively to surfaces and bind optimally at a valency of three. We quantitatively explain this optimum by extending the current theory with interactions between DNA nanostar binding sites (ligands). Our results add to the understanding of multivalent interactions, by identifying microscopic mechanisms that lead to optimal selectivity, and providing quantitative values for the relevant parameters. These findings inspire additional design rules which improve future work on selective targeting in directed drug delivery. △ Less

Submitted 29 June, 2023; originally announced June 2023.

Comments: 14 pages, 4 figures

arXiv:2306.10171 [pdf, other]

Bootstrapped Representations in Reinforcement Learning

Authors: Charline Le Lan, Stephen Tu, Mark Rowland, Anna Harutyunyan, Rishabh Agarwal, Marc G. Bellemare, Will Dabney

Abstract: In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated i… ▽ More In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated into the learning process and help shape the learnt state representation. Bootstrapping methods are today's method of choice to make these additional predictions. Yet, it is unclear which features these algorithms capture and how they relate to those from other auxiliary-task-based approaches. In this paper, we address this gap and provide a theoretical characterization of the state representation learnt by temporal difference learning (Sutton, 1988). Surprisingly, we find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting. We describe the efficacy of these representations for policy evaluation, and use our theoretical analysis to design new auxiliary learning rules. We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al, 1999) and Mountain Car (Moore, 1990). △ Less

Submitted 16 June, 2023; originally announced June 2023.

Comments: ICML 2023

arXiv:2306.04979 [pdf, other]

CoCo: A Coupled Contrastive Framework for Unsupervised Domain Adaptive Graph Classification

Authors: Nan Yin, Li Shen, Mengzhu Wang, Long Lan, Zeyu Ma, Chong Chen, Xian-Sheng Hua, Xiao Luo

Abstract: Although graph neural networks (GNNs) have achieved impressive achievements in graph classification, they often need abundant task-specific labels, which could be extensively costly to acquire. A credible solution is to explore additional labeled graphs to enhance unsupervised learning on the target domain. However, how to apply GNNs to domain adaptation remains unsolved owing to the insufficient… ▽ More Although graph neural networks (GNNs) have achieved impressive achievements in graph classification, they often need abundant task-specific labels, which could be extensively costly to acquire. A credible solution is to explore additional labeled graphs to enhance unsupervised learning on the target domain. However, how to apply GNNs to domain adaptation remains unsolved owing to the insufficient exploration of graph topology and the significant domain discrepancy. In this paper, we propose Coupled Contrastive Graph Representation Learning (CoCo), which extracts the topological information from coupled learning branches and reduces the domain discrepancy with coupled contrastive learning. CoCo contains a graph convolutional network branch and a hierarchical graph kernel network branch, which explore graph topology in implicit and explicit manners. Besides, we incorporate coupled branches into a holistic multi-view contrastive learning framework, which not only incorporates graph representations learned from complementary views for enhanced understanding, but also encourages the similarity between cross-domain example pairs with the same semantics for domain alignment. Extensive experiments on popular datasets show that our CoCo outperforms these competing baselines in different settings generally. △ Less

Submitted 10 June, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

arXiv:2306.02585 [pdf, other]

MotionTrack: Learning Motion Predictor for Multiple Object Tracking

Authors: Changcheng Xiao, Qiong Cao, Yujie Zhong, Long Lan, Xiang Zhang, Zhigang Luo, Dacheng Tao

Abstract: Significant progress has been achieved in multi-object tracking (MOT) through the evolution of detection and re-identification (ReID) techniques. Despite these advancements, accurately tracking objects in scenarios with homogeneous appearance and heterogeneous motion remains a challenge. This challenge arises from two main factors: the insufficient discriminability of ReID features and the predomi… ▽ More Significant progress has been achieved in multi-object tracking (MOT) through the evolution of detection and re-identification (ReID) techniques. Despite these advancements, accurately tracking objects in scenarios with homogeneous appearance and heterogeneous motion remains a challenge. This challenge arises from two main factors: the insufficient discriminability of ReID features and the predominant utilization of linear motion models in MOT. In this context, we introduce a novel motion-based tracker, MotionTrack, centered around a learnable motion predictor that relies solely on object trajectory information. This predictor comprehensively integrates two levels of granularity in motion features to enhance the modeling of temporal dynamics and facilitate precise future motion prediction for individual objects. Specifically, the proposed approach adopts a self-attention mechanism to capture token-level information and a Dynamic MLP layer to model channel-level features. MotionTrack is a simple, online tracking approach. Our experimental results demonstrate that MotionTrack yields state-of-the-art performance on datasets such as Dancetrack and SportsMOT, characterized by highly complex object motion. △ Less

Submitted 11 March, 2024; v1 submitted 5 June, 2023; originally announced June 2023.

arXiv:2305.09090 [pdf, other]

BOSS -- Biomarker Optimal Segmentation System

Authors: Liuyi Lan, Xuanjin Cheng, Li Xing, Xuekui Zhang

Abstract: Motivation: Precision medicine is a major trend in the future of medicine. It aims to provide tailored medical treatment and prevention strategies based on an individual's unique characteristics and needs. Biomarker is the primary source of patients' unique features used in precision medicine. We often need to investigate many cutoff values of a continuous biomarker to find the optimal one and tes… ▽ More Motivation: Precision medicine is a major trend in the future of medicine. It aims to provide tailored medical treatment and prevention strategies based on an individual's unique characteristics and needs. Biomarker is the primary source of patients' unique features used in precision medicine. We often need to investigate many cutoff values of a continuous biomarker to find the optimal one and test if it can help segment patients into two groups with significantly different clinical outcomes. This requires multiple testing adjustments on tests conducted on overlapped data. The permutation-based approach is often a preferred solution, since it does not suffer the limitations of state-of-art theoretical methods. However, permutation is computationally expensive and limits its application scenarios, such as web applications requiring a fast response or the analysis of genomic study requiring to repeat analysis many times on tens of thousands of genes. Results: We proposed a novel method BOSS, Biomarker Optimal Segmentation System, to solve this problem. In simulation studies, we found BOSS's statistical power and type I error control are both non-inferior to the permutation approach, and it is hundreds of times faster than permutation. To illustrate our method, we applied BOSS to real data and revealed potentially converging biomarkers that have referential importance in exploring synergy and target-matched therapies in lung adenocarcinoma. Availability: An R package, boss, is being developed and will be available on CRAN △ Less

Submitted 15 May, 2023; originally announced May 2023.

arXiv:2305.06710 [pdf, other]

Null-text Guidance in Diffusion Models is Secretly a Cartoon-style Creator

Authors: Jing Zhao, Heliang Zheng, Chaoyue Wang, Long Lan, Wanrong Huang, Wenjing Yang

Abstract: Classifier-free guidance is an effective sampling technique in diffusion models that has been widely adopted. The main idea is to extrapolate the model in the direction of text guidance and away from null-text guidance. In this paper, we demonstrate that null-text guidance in diffusion models is secretly a cartoon-style creator, i.e., the generated images can be efficiently transformed into cartoo… ▽ More Classifier-free guidance is an effective sampling technique in diffusion models that has been widely adopted. The main idea is to extrapolate the model in the direction of text guidance and away from null-text guidance. In this paper, we demonstrate that null-text guidance in diffusion models is secretly a cartoon-style creator, i.e., the generated images can be efficiently transformed into cartoons by simply perturbing the null-text guidance. Specifically, we proposed two disturbance methods, i.e., Rollback disturbance (Back-D) and Image disturbance (Image-D), to construct misalignment between the noisy images used for predicting null-text guidance and text guidance (subsequently referred to as \textbf{null-text noisy image} and \textbf{text noisy image} respectively) in the sampling process. Back-D achieves cartoonization by altering the noise level of null-text noisy image via replacing $x_t$ with $x_{t+Δt}$. Image-D, alternatively, produces high-fidelity, diverse cartoons by defining $x_t$ as a clean input image, which further improves the incorporation of finer image details. Through comprehensive experiments, we delved into the principle of noise disturbing for null-text and uncovered that the efficacy of disturbance depends on the correlation between the null-text noisy image and the source image. Moreover, our proposed techniques, which can generate cartoon images and cartoonize specific ones, are training-free and easily integrated as a plug-and-play component in any classifier-free guided diffusion model. Project page is available at \url{https://nulltextforcartoon.github.io/}. △ Less

Submitted 3 August, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

Comments: Accepted by ACM MM 2023

Journal ref: ACM MM 2023

arXiv:2304.13424 [pdf, other]

Can Agents Run Relay Race with Strangers? Generalization of RL to Out-of-Distribution Trajectories

Authors: Li-Cheng Lan, Huan Zhang, Cho-Jui Hsieh

Abstract: In this paper, we define, evaluate, and improve the ``relay-generalization'' performance of reinforcement learning (RL) agents on the out-of-distribution ``controllable'' states. Ideally, an RL agent that generally masters a task should reach its goal starting from any controllable state of the environment instead of memorizing a small set of trajectories. For example, a self-driving system should… ▽ More In this paper, we define, evaluate, and improve the ``relay-generalization'' performance of reinforcement learning (RL) agents on the out-of-distribution ``controllable'' states. Ideally, an RL agent that generally masters a task should reach its goal starting from any controllable state of the environment instead of memorizing a small set of trajectories. For example, a self-driving system should be able to take over the control from humans in the middle of driving and must continue to drive the car safely. To practically evaluate this type of generalization, we start the test agent from the middle of other independently well-trained \emph{stranger} agents' trajectories. With extensive experimental evaluation, we show the prevalence of \emph{generalization failure} on controllable states from stranger agents. For example, in the Humanoid environment, we observed that a well-trained Proximal Policy Optimization (PPO) agent, with only 3.9\% failure rate during regular testing, failed on 81.6\% of the states generated by well-trained stranger PPO agents. To improve "relay generalization," we propose a novel method called Self-Trajectory Augmentation (STA), which will reset the environment to the agent's old states according to the Q function during training. After applying STA to the Soft Actor Critic's (SAC) training procedure, we reduced the failure rate of SAC under relay-evaluation by more than three times in most settings without impacting agent performance and increasing the needed number of environment interactions. Our code is available at https://github.com/lan-lc/STA. △ Less

Submitted 26 April, 2023; originally announced April 2023.

Comments: ICRL 2023

arXiv:2304.12567 [pdf, other]

Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks

Authors: Jesse Farebrother, Joshua Greaves, Rishabh Agarwal, Charline Le Lan, Ross Goroshin, Pablo Samuel Castro, Marc G. Bellemare

Abstract: Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well understood; in practice, however, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treate… ▽ More Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well understood; in practice, however, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent's network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable off-policy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)'s proto-value functions to deep reinforcement learning -- accordingly, we call the resulting object proto-value networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that proto-value networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment's reward function. △ Less

Submitted 25 April, 2023; originally announced April 2023.

Comments: ICLR 2023. Code and models are available at https://github.com/google-research/google-research/tree/master/pvn 22 pages, 8 figures

Showing 1–50 of 163 results for author: Lan, L