-
Hybrid-Generative Diffusion Models for Attack-Oriented Twin Migration in Vehicular Metaverses
Authors:
Yingkai Kang,
Jinbo Wen,
Jiawen Kang,
Tao Zhang,
Hongyang Du,
Dusit Niyato,
Rong Yu,
Shengli Xie
Abstract:
The vehicular metaverse is envisioned as a blended immersive domain that promises to bring revolutionary changes to the automotive industry. As a core component of vehicular metaverses, Vehicle Twins (VTs) are digital twins that cover the entire life cycle of vehicles, providing immersive virtual services for Vehicular Metaverse Users (VMUs). Vehicles with limited resources offload the computation…
▽ More
The vehicular metaverse is envisioned as a blended immersive domain that promises to bring revolutionary changes to the automotive industry. As a core component of vehicular metaverses, Vehicle Twins (VTs) are digital twins that cover the entire life cycle of vehicles, providing immersive virtual services for Vehicular Metaverse Users (VMUs). Vehicles with limited resources offload the computationally intensive tasks of constructing and updating VTs to edge servers and migrate VTs between these servers, ensuring seamless and immersive experiences for VMUs. However, the high mobility of vehicles, uneven deployment of edge servers, and potential security threats pose challenges to achieving efficient and reliable VT migrations. To address these issues, we propose a secure and reliable VT migration framework in vehicular metaverses. Specifically, we design a two-layer trust evaluation model to comprehensively evaluate the reputation value of edge servers in the network communication and interaction layers. Then, we model the VT migration problem as a partially observable Markov decision process and design a hybrid-Generative Diffusion Model (GDM) algorithm based on deep reinforcement learning to generate optimal migration decisions by taking hybrid actions (i.e., continuous actions and discrete actions). Numerical results demonstrate that the hybrid-GDM algorithm outperforms the baseline algorithms, showing strong adaptability in various settings and highlighting the potential of the hybrid-GDM algorithm for addressing various optimization issues in vehicular metaverses.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Demonstration of Si-doped Al-rich thin regrown Al(Ga)N films on AlN on sapphire templates with $\gt10^{15}/cm^3$ free carrier concentration using close-coupled showerhead MOCVD reactor
Authors:
Swarnav Mukhopadhyay,
Parthasarathy Seshadri,
Mobinul Haque,
Shuwen Xie,
Ruixin Bai,
Surjava Sanyal,
Guangying Wang,
Chirag Gupta,
Shubhra S. Pasayat
Abstract:
Thin Si-doped Al-rich (Al>0.85) regrown Al(Ga)N layers were deposited on AlN on Sapphire template using metal-organic chemical vapor deposition (MOCVD) techniques. The optimization of the deposition conditions such as temperature, V/III ratio, deposition rate, and Si concentration resulted in a high charge carrier concentration (>$10^{15}/cm^{3}$) in the Si-doped Al-rich Al(Ga)N films. A pulsed de…
▽ More
Thin Si-doped Al-rich (Al>0.85) regrown Al(Ga)N layers were deposited on AlN on Sapphire template using metal-organic chemical vapor deposition (MOCVD) techniques. The optimization of the deposition conditions such as temperature, V/III ratio, deposition rate, and Si concentration resulted in a high charge carrier concentration (>$10^{15}/cm^{3}$) in the Si-doped Al-rich Al(Ga)N films. A pulsed deposition condition was employed to achieve a controllable Al composition greater than 95% and to prevent unintended Ga incorporation in the AlGaN material deposited using the close-coupled showerhead reactor. Also, the effect of unintentional Si incorporation on free charge carrier concentration at the regrowth interface was observed by varying the thickness of the regrown Al(Ga)N layer. A maximum charge carrier concentration of $4.8\times 10^{16}/cm^3$ and $7.5\times 10^{15}/cm^3$ were achieved for Al0.97Ga0.03N and AlN films with thickness <300 nm compared to previously reported n-Al(Ga)N films with thickness $\ge$400 nm deposited using MOCVD technique.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Results for pixel and strip centimeter-scale AC-LGAD sensors with a 120 GeV proton beam
Authors:
Irene Dutta,
Christopher Madrid,
Ryan Heller,
Shirsendu Nanda,
Danush Shekar,
Claudio San Martín,
Matías Barría,
Artur Apresyan,
Zhenyu Ye,
William K. Brooks,
Wei Chen,
Gabriele D'Amen,
Gabriele Giacomini,
Alessandro Tricoli,
Aram Hayrapetyan,
Hakseong Lee,
Ohannes Kamer Köseyan,
Sergey Los,
Koji Nakamura,
Sayuka Kita,
Tomoka Imamura,
Cristían Peña,
Si Xie
Abstract:
We present the results of an extensive evaluation of strip and pixel AC-LGAD sensors tested with a 120 GeV proton beam, focusing on the influence of design parameters on the sensor temporal and spatial resolutions. Results show that reducing the thickness of pixel sensors significantly enhances their time resolution, with 20 $μ$m-thick sensors achieving around 20 ps. Uniform performance is attaina…
▽ More
We present the results of an extensive evaluation of strip and pixel AC-LGAD sensors tested with a 120 GeV proton beam, focusing on the influence of design parameters on the sensor temporal and spatial resolutions. Results show that reducing the thickness of pixel sensors significantly enhances their time resolution, with 20 $μ$m-thick sensors achieving around 20 ps. Uniform performance is attainable with optimized sheet resistance, making these sensors ideal for future timing detectors. Conversely, 20 $μ$m-thick strip sensors exhibit higher jitter than similar pixel sensors, negatively impacting time resolution, despite reduced Landau fluctuations with respect to the 50 $μ$m-thick versions. Additionally, it is observed that a low resistivity in strip sensors limits signal size and time resolution, whereas higher resistivity improves performance. This study highlights the importance of tuning the n$^{+}$ sheet resistance and suggests that further improvements should target specific applications like the Electron-Ion Collider or other future collider experiments. In addition, the detailed performance of four AC-LGADs sensor designs is reported as examples of possible candidates for specific detector applications. These advancements position AC-LGADs as promising candidates for future 4D tracking systems, pending the development of specialized readout electronics.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
A Joint Approach to Local Updating and Gradient Compression for Efficient Asynchronous Federated Learning
Authors:
Jiajun Song,
Jiajun Luo,
Rongwei Lu,
Shuzhao Xie,
Bin Chen,
Zhi Wang
Abstract:
Asynchronous Federated Learning (AFL) confronts inherent challenges arising from the heterogeneity of devices (e.g., their computation capacities) and low-bandwidth environments, both potentially causing stale model updates (e.g., local gradients) for global aggregation. Traditional approaches mitigating the staleness of updates typically focus on either adjusting the local updating or gradient co…
▽ More
Asynchronous Federated Learning (AFL) confronts inherent challenges arising from the heterogeneity of devices (e.g., their computation capacities) and low-bandwidth environments, both potentially causing stale model updates (e.g., local gradients) for global aggregation. Traditional approaches mitigating the staleness of updates typically focus on either adjusting the local updating or gradient compression, but not both. Recognizing this gap, we introduce a novel approach that synergizes local updating with gradient compression. Our research begins by examining the interplay between local updating frequency and gradient compression rate, and their collective impact on convergence speed. The theoretical upper bound shows that the local updating frequency and gradient compression rate of each device are jointly determined by its computing power, communication capabilities and other factors. Building on this foundation, we propose an AFL framework called FedLuck that adaptively optimizes both local update frequency and gradient compression rates. Experiments on image classification and speech recognization show that FedLuck reduces communication consumption by 56% and training time by 55% on average, achieving competitive performance in heterogeneous and low-bandwidth scenarios compared to the baselines.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Toward Precise Robotic Weed Flaming Using a Mobile Manipulator with a Flamethrower
Authors:
Di Wang,
Chengsong Hu,
Shuangyu Xie,
Joe Johnson,
Hojun Ji,
Yingtao Jiang,
Muthukumar Bagavathiannan,
Dezhen Song
Abstract:
Robotic weed flaming is a new and environmentally friendly approach to weed removal in the agricultural field. Using a mobile manipulator equipped with a flamethrower, we design a new system and algorithm to enable effective weed flaming, which requires robotic manipulation with a soft and deformable end effector, as the thermal coverage of the flame is affected by dynamic or unknown environmental…
▽ More
Robotic weed flaming is a new and environmentally friendly approach to weed removal in the agricultural field. Using a mobile manipulator equipped with a flamethrower, we design a new system and algorithm to enable effective weed flaming, which requires robotic manipulation with a soft and deformable end effector, as the thermal coverage of the flame is affected by dynamic or unknown environmental factors such as gravity, wind, atmospheric pressure, fuel tank pressure, and pose of the nozzle. System development includes overall design, hardware integration, and software pipeline. To enable precise weed removal, the greatest challenge is to detect and predict dynamic flame coverage in real time before motion planning, which is quite different from a conventional rigid gripper in grasping or a spray gun in painting. Based on the images from two onboard infrared cameras and the pose information of the flamethrower nozzle on a mobile manipulator, we propose a new dynamic flame coverage model. The flame model uses a center-arc curve with a Gaussian cross-section model to describe the flame coverage in real time. The experiments have demonstrated the working system and shown that our model and algorithm can achieve a mean average precision (mAP) of more than 76\% in the reprojected images during online prediction.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
Authors:
Ye Bai,
Jingping Chen,
Jitong Chen,
Wei Chen,
Zhuo Chen,
Chuang Ding,
Linhao Dong,
Qianqian Dong,
Yujiao Du,
Kepan Gao,
Lu Gao,
Yi Guo,
Minglun Han,
Ting Han,
Wenchao Hu,
Xinying Hu,
Yuxiang Hu,
Deyu Hua,
Lu Huang,
Mingkun Huang,
Youjia Huang,
Jishuo Jin,
Fanliu Kong,
Zongwei Lan,
Tianyu Li
, et al. (30 additional authors not shown)
Abstract:
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor…
▽ More
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.
△ Less
Submitted 10 July, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
A new subclass of gamma-ray burst originating from compact binary merger
Authors:
Chen-Wei Wang,
Wen-Jun Tan,
Shao-Lin Xiong,
Shu-Xu Yi,
Rahim Moradi,
Bing Li,
Zhen Zhang,
Yu Wang,
Yan-Zhi Meng,
Jia-Cong Liu,
Yue Wang,
Sheng-Lun Xie,
Wang-Chen Xue,
Zheng-Hang Yu,
Peng Zhang,
Wen-Long Zhang,
Yan-Qiu Zhang,
Chao Zheng
Abstract:
Type I gamma-ray bursts (GRBs) are believed to originate from compact binary merger usually with duration less than 2 seconds for the main emission. However, recent observations of GRB 211211A and GRB 230307A indicate that some merger-origin GRBs could last much longer. Since they show strikingly similar properties (indicating a common mechanism) which are different from the classic "long"-short b…
▽ More
Type I gamma-ray bursts (GRBs) are believed to originate from compact binary merger usually with duration less than 2 seconds for the main emission. However, recent observations of GRB 211211A and GRB 230307A indicate that some merger-origin GRBs could last much longer. Since they show strikingly similar properties (indicating a common mechanism) which are different from the classic "long"-short burst (e.g. GRB 060614), forming an interesting subclass of type I GRBs, we suggest to name them as type IL GRBs. By identifying the first peak of GRB 230307A as a quasi-thermal precursor, we find that the prompt emission of type IL GRB is composed of three episodes: (1) a precursor followed by a short quiescent (or weak emission) period, (2) a long-duration main emission, and (3) an extended emission. With this burst pattern, a good candidate, GRB 170228A, was found in the Fermi/GBM archive data, and subsequent temporal and spectral analyses indeed show that GRB 170228A falls in the same cluster with GRB 211211A and GRB 230307A in many diagnostic figures. Thus this burst pattern could be a good reference for rapidly identifying type IL GRB and conducting low-latency follow-up observation. We estimated the occurrence rate and discussed the physical origins and implications for the three emission episodes of type IL GRBs. Our analysis suggests the pre-merger precursor model, especially the super flare model, is more favored for type IL GRBs.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Embedded Prompt Tuning: Towards Enhanced Calibration of Pretrained Models for Medical Images
Authors:
Wenqiang Zu,
Shenghao Xie,
Qing Zhao,
Guoqi Li,
Lei Ma
Abstract:
Foundation models pre-trained on large-scale data have been widely witnessed to achieve success in various natural imaging downstream tasks. Parameter-efficient fine-tuning (PEFT) methods aim to adapt foundation models to new domains by updating only a small portion of parameters in order to reduce computational overhead. However, the effectiveness of these PEFT methods, especially in cross-domain…
▽ More
Foundation models pre-trained on large-scale data have been widely witnessed to achieve success in various natural imaging downstream tasks. Parameter-efficient fine-tuning (PEFT) methods aim to adapt foundation models to new domains by updating only a small portion of parameters in order to reduce computational overhead. However, the effectiveness of these PEFT methods, especially in cross-domain few-shot scenarios, e.g., medical image analysis, has not been fully explored. In this work, we facilitate the study of the performance of PEFT when adapting foundation models to medical image classification tasks. Furthermore, to alleviate the limitations of prompt introducing ways and approximation capabilities on Transformer architectures of mainstream prompt tuning methods, we propose the Embedded Prompt Tuning (EPT) method by embedding prompt tokens into the expanded channels. We also find that there are anomalies in the feature space distribution of foundation models during pre-training process, and prompt tuning can help mitigate this negative impact. To explain this phenomenon, we also introduce a novel perspective to understand prompt tuning: Prompt tuning is a distribution calibrator. And we support it by analyzing patch-wise scaling and feature separation operations contained in EPT. Our experiments show that EPT outperforms several state-of-the-art fine-tuning methods by a significant margin on few-shot medical image classification tasks, and completes the fine-tuning process within highly competitive time, indicating EPT is an effective PEFT method. The source code is available at github.com/zuwenqiang/EPT.
△ Less
Submitted 2 July, 2024; v1 submitted 1 July, 2024;
originally announced July 2024.
-
YuLan: An Open-source Large Language Model
Authors:
Yutao Zhu,
Kun Zhou,
Kelong Mao,
Wentong Chen,
Yiding Sun,
Zhipeng Chen,
Qian Cao,
Yihan Wu,
Yushuo Chen,
Feng Wang,
Lei Zhang,
Junyi Li,
Xiaolei Wang,
Lei Wang,
Beichen Zhang,
Zican Dong,
Xiaoxue Cheng,
Yuhan Chen,
Xinyu Tang,
Yupeng Hou,
Qiangqiang Ren,
Xincheng Pang,
Shufang Xie,
Wayne Xin Zhao,
Zhicheng Dou
, et al. (13 additional authors not shown)
Abstract:
Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billi…
▽ More
Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billion parameters. The base model of YuLan is pre-trained on approximately $1.7$T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan's overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner. YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks. This paper outlines a comprehensive technical roadmap for developing LLMs from scratch. Our model and codes are available at https://github.com/RUC-GSAI/YuLan-Chat.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Hole probabilities of random zeros on compact Riemann surfaces
Authors:
Hao Wu,
Song-Yan Xie
Abstract:
We establish a convergence speed estimate for hole probabilities of zeros of random holomorphic sections on compact Riemann surfaces.
We establish a convergence speed estimate for hole probabilities of zeros of random holomorphic sections on compact Riemann surfaces.
△ Less
Submitted 1 July, 2024; v1 submitted 27 June, 2024;
originally announced June 2024.
-
On Scaling Up 3D Gaussian Splatting Training
Authors:
Hexu Zhao,
Haoyang Weng,
Daohan Lu,
Ang Li,
Jinyang Li,
Aurojit Panda,
Saining Xie
Abstract:
3D Gaussian Splatting (3DGS) is increasingly popular for 3D reconstruction due to its superior visual quality and rendering speed. However, 3DGS training currently occurs on a single GPU, limiting its ability to handle high-resolution and large-scale 3D reconstruction tasks due to memory constraints. We introduce Grendel, a distributed system designed to partition 3DGS parameters and parallelize c…
▽ More
3D Gaussian Splatting (3DGS) is increasingly popular for 3D reconstruction due to its superior visual quality and rendering speed. However, 3DGS training currently occurs on a single GPU, limiting its ability to handle high-resolution and large-scale 3D reconstruction tasks due to memory constraints. We introduce Grendel, a distributed system designed to partition 3DGS parameters and parallelize computation across multiple GPUs. As each Gaussian affects a small, dynamic subset of rendered pixels, Grendel employs sparse all-to-all communication to transfer the necessary Gaussians to pixel partitions and performs dynamic load balancing. Unlike existing 3DGS systems that train using one camera view image at a time, Grendel supports batched training with multiple views. We explore various optimization hyperparameter scaling strategies and find that a simple sqrt(batch size) scaling rule is highly effective. Evaluations using large-scale, high-resolution scenes show that Grendel enhances rendering quality by scaling up 3DGS parameters across multiple GPUs. On the Rubble dataset, we achieve a test PSNR of 27.28 by distributing 40.4 million Gaussians across 16 GPUs, compared to a PSNR of 26.28 using 11.2 million Gaussians on a single GPU. Grendel is an open-source project available at: https://github.com/nyu-systems/Grendel-GS
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Authors:
Shengbang Tong,
Ellis Brown,
Penghao Wu,
Sanghyun Woo,
Manoj Middepogu,
Sai Charitha Akula,
Jihan Yang,
Shusheng Yang,
Adithya Iyer,
Xichen Pan,
Austin Wang,
Rob Fergus,
Yann LeCun,
Saining Xie
Abstract:
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and…
▽ More
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combinations thereof -- based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, addressing the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Tailored topotactic chemistry unlocks heterostructures of magnetic intercalation compounds
Authors:
Samra Husremović,
Oscar Gonzalez,
Berit H. Goodge,
Lilia S. Xie,
Zhizhi Kong,
Wanlin Zhang,
Sae Hee Ryu,
Stephanie M. Ribet,
Karen C. Bustillo,
Chengyu Song,
Jim Ciston,
Takashi Taniguchi,
Kenji Watanabe,
Colin Ophus,
Chris Jozwiak,
Aaron Bostwick,
Eli Rotenberg,
D. Kwabena Bediako
Abstract:
The construction of thin film heterostructures has been a widely successful archetype for fabricating materials with emergent physical properties. This strategy is of particular importance for the design of multilayer magnetic architectures in which direct interfacial spin--spin interactions between magnetic phases in dissimilar layers lead to emergent and controllable magnetic behavior. However,…
▽ More
The construction of thin film heterostructures has been a widely successful archetype for fabricating materials with emergent physical properties. This strategy is of particular importance for the design of multilayer magnetic architectures in which direct interfacial spin--spin interactions between magnetic phases in dissimilar layers lead to emergent and controllable magnetic behavior. However, crystallographic incommensurability and atomic-scale interfacial disorder can severely limit the types of materials amenable to this strategy, as well as the performance of these systems. Here, we demonstrate a method for synthesizing heterostructures comprising magnetic intercalation compounds of transition metal dichalcogenides (TMDs), through directed topotactic reaction of the TMD with a metal oxide. The mechanism of the intercalation reaction enables thermally initiated intercalation of the TMD from lithographically patterned oxide films, giving access to a new family of multi-component magnetic architectures through the combination of deterministic van der Waals assembly and directed intercalation chemistry.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Tuning-Free Visual Customization via View Iterative Self-Attention Control
Authors:
Xiaojie Li,
Chenghao Gu,
Shuzhao Xie,
Yunpeng Bai,
Weixiang Zhang,
Zhi Wang
Abstract:
Fine-Tuning Diffusion Models enable a wide range of personalized generation and editing applications on diverse visual modalities. While Low-Rank Adaptation (LoRA) accelerates the fine-tuning process, it still requires multiple reference images and time-consuming training, which constrains its scalability for large-scale and real-time applications. In this paper, we propose \textit{View Iterative…
▽ More
Fine-Tuning Diffusion Models enable a wide range of personalized generation and editing applications on diverse visual modalities. While Low-Rank Adaptation (LoRA) accelerates the fine-tuning process, it still requires multiple reference images and time-consuming training, which constrains its scalability for large-scale and real-time applications. In this paper, we propose \textit{View Iterative Self-Attention Control (VisCtrl)} to tackle this challenge. Specifically, VisCtrl is a training-free method that injects the appearance and structure of a user-specified subject into another subject in the target image, unlike previous approaches that require fine-tuning the model. Initially, we obtain the initial noise for both the reference and target images through DDIM inversion. Then, during the denoising phase, features from the reference image are injected into the target image via the self-attention mechanism. Notably, by iteratively performing this feature injection process, we ensure that the reference image features are gradually integrated into the target image. This approach results in consistent and harmonious editing with only one reference image in a few denoising steps. Moreover, benefiting from our plug-and-play architecture design and the proposed Feature Gradual Sampling strategy for multi-view editing, our method can be easily extended to edit in complex visual domains. Extensive experiments show the efficacy of VisCtrl across a spectrum of tasks, including personalized editing of images, videos, and 3D scenes.
△ Less
Submitted 10 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Finding irregular subgraphs via local adjustments
Authors:
Jie Ma,
Shengjie Xie
Abstract:
For a graph $H$, let $m(H,k)$ denote the number of vertices of degree $k$ in $H$. A conjecture of Alon and Wei states that for any $d\geq 3$, every $n$-vertex $d$-regular graph contains a spanning subgraph $H$ satisfying $|m(H,k)-\frac{n}{d+1}|\leq 2$ for every $0\leq k \leq d$. This holds easily when $d\leq 2$. An asymptotic version of this conjecture was initially established by Frieze, Gould, K…
▽ More
For a graph $H$, let $m(H,k)$ denote the number of vertices of degree $k$ in $H$. A conjecture of Alon and Wei states that for any $d\geq 3$, every $n$-vertex $d$-regular graph contains a spanning subgraph $H$ satisfying $|m(H,k)-\frac{n}{d+1}|\leq 2$ for every $0\leq k \leq d$. This holds easily when $d\leq 2$. An asymptotic version of this conjecture was initially established by Frieze, Gould, Karoński and Pfender, subsequently improved by Alon and Wei, and most recently enhanced by Fox, Luo and Pham, approaching its complete range. All of these approaches relied on probabilistic methods.
In this paper, we provide a novel framework to study this conjecture, based on localized deterministic techniques which we call local adjustments. We prove two main results. Firstly, we show that every $n$-vertex $d$-regular graph contains a spanning subgraph $H$ satisfying $|m(H,k)-\frac{n}{d+1}|\leq 2d^2$ for all $0\leq k \leq d$, which provides the first bound independent of the value of $n$. Secondly, we confirm the case $d=3$ of the Alon-Wei Conjecture in a strong form. Both results can be generalized to multigraphs and yield efficient algorithms for finding the desired subgraphs $H$. Furthermore, we explore a generalization of the Alon-Wei Conjecture for multigraphs and its connection to the Faudree-Lehel Conjecture concerning irregularity strength.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
On location of maximal gradient of torsion function over some non-symmetric planar domains
Authors:
Qinfeng Li,
Shuangquan Xie,
Hang Yang,
Ruofei Yao
Abstract:
We investigate the location of the maximal gradient of the torsion function on some non-symmetric planar domains. First, for triangles, by reflection method, we show that the maximal gradient of the torsion function always occurs on the longest sides, lying between the foot of the altitude and the middle point. Moreover, via nodal line analysis and continuity method, we demonstrate that restricted…
▽ More
We investigate the location of the maximal gradient of the torsion function on some non-symmetric planar domains. First, for triangles, by reflection method, we show that the maximal gradient of the torsion function always occurs on the longest sides, lying between the foot of the altitude and the middle point. Moreover, via nodal line analysis and continuity method, we demonstrate that restricted on each side, the critical point of gradient of the torsion function is unique and nondegenerate. Second, by establishing uniform estimates for narrow domains, we prove that as a planar domain bounded by two graphs of function becomes increasingly narrow, the location of maximal gradient of its torsion tends toward the endpoint of the longest vertical line segment, with smaller curvature among them. This shows that Saint-Venant's conjecture on location of fail points is valid for asymptotically narrow domains. Third, using the reflection method, we prove that for a non-concentric annulus, maximal gradient of torsion always occurs at the point on the inner ring closest to the center of the outer ring.
△ Less
Submitted 14 June, 2024; v1 submitted 7 June, 2024;
originally announced June 2024.
-
Meta-Designing Quantum Experiments with Language Models
Authors:
Sören Arlt,
Haonan Duan,
Felix Li,
Sang Michael Xie,
Yuhuai Wu,
Mario Krenn
Abstract:
Artificial Intelligence (AI) has the potential to significantly advance scientific discovery by finding solutions beyond human capabilities. However, these super-human solutions are often unintuitive and require considerable effort to uncover underlying principles, if possible at all. Here, we show how a code-generating language model trained on synthetic data can not only find solutions to specif…
▽ More
Artificial Intelligence (AI) has the potential to significantly advance scientific discovery by finding solutions beyond human capabilities. However, these super-human solutions are often unintuitive and require considerable effort to uncover underlying principles, if possible at all. Here, we show how a code-generating language model trained on synthetic data can not only find solutions to specific problems but can create meta-solutions, which solve an entire class of problems in one shot and simultaneously offer insight into the underlying design principles. Specifically, for the design of new quantum physics experiments, our sequence-to-sequence transformer architecture generates interpretable Python code that describes experimental blueprints for a whole class of quantum systems. We discover general and previously unknown design rules for infinitely large classes of quantum states. The ability to automatically generate generalized patterns in readable computer code is a crucial step toward machines that help discover new scientific understanding -- one of the central aims of physics.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
EMERGE: Integrating RAG for Improved Multimodal EHR Predictive Modeling
Authors:
Yinghao Zhu,
Changyu Ren,
Zixiang Wang,
Xiaochen Zheng,
Shiyun Xie,
Junlan Feng,
Xi Zhu,
Zhoujun Li,
Liantao Ma,
Chengwei Pan
Abstract:
The integration of multimodal Electronic Health Records (EHR) data has notably advanced clinical predictive capabilities. However, current models that utilize clinical notes and multivariate time-series EHR data often lack the necessary medical context for precise clinical tasks. Previous methods using knowledge graphs (KGs) primarily focus on structured knowledge extraction. To address this, we p…
▽ More
The integration of multimodal Electronic Health Records (EHR) data has notably advanced clinical predictive capabilities. However, current models that utilize clinical notes and multivariate time-series EHR data often lack the necessary medical context for precise clinical tasks. Previous methods using knowledge graphs (KGs) primarily focus on structured knowledge extraction. To address this, we propose EMERGE, a Retrieval-Augmented Generation (RAG) driven framework aimed at enhancing multimodal EHR predictive modeling. Our approach extracts entities from both time-series data and clinical notes by prompting Large Language Models (LLMs) and aligns them with professional PrimeKG to ensure consistency. Beyond triplet relationships, we include entities' definitions and descriptions to provide richer semantics. The extracted knowledge is then used to generate task-relevant summaries of patients' health statuses. These summaries are fused with other modalities utilizing an adaptive multimodal fusion network with cross-attention. Extensive experiments on the MIMIC-III and MIMIC-IV datasets for in-hospital mortality and 30-day readmission tasks demonstrate the superior performance of the EMERGE framework compared to baseline models. Comprehensive ablation studies and analyses underscore the efficacy of each designed module and the framework's robustness to data sparsity. EMERGE significantly enhances the use of multimodal EHR data in healthcare, bridging the gap with nuanced medical contexts crucial for informed clinical predictions.
△ Less
Submitted 27 May, 2024;
originally announced June 2024.
-
Exploring Backdoor Attacks against Large Language Model-based Decision Making
Authors:
Ruochen Jiao,
Shaoyuan Xie,
Justin Yue,
Takami Sato,
Lixu Wang,
Yixuan Wang,
Qi Alfred Chen,
Qi Zhu
Abstract:
Large Language Models (LLMs) have shown significant promise in decision-making tasks when fine-tuned on specific applications, leveraging their inherent common sense and reasoning abilities learned from vast amounts of data. However, these systems are exposed to substantial safety and security risks during the fine-tuning phase. In this work, we propose the first comprehensive framework for Backdo…
▽ More
Large Language Models (LLMs) have shown significant promise in decision-making tasks when fine-tuned on specific applications, leveraging their inherent common sense and reasoning abilities learned from vast amounts of data. However, these systems are exposed to substantial safety and security risks during the fine-tuning phase. In this work, we propose the first comprehensive framework for Backdoor Attacks against LLM-enabled Decision-making systems (BALD), systematically exploring how such attacks can be introduced during the fine-tuning phase across various channels. Specifically, we propose three attack mechanisms and corresponding backdoor optimization methods to attack different components in the LLM-based decision-making pipeline: word injection, scenario manipulation, and knowledge injection. Word injection embeds trigger words directly into the query prompt. Scenario manipulation occurs in the physical environment, where a high-level backdoor semantic scenario triggers the attack. Knowledge injection conducts backdoor attacks on retrieval augmented generation (RAG)-based LLM systems, strategically injecting word triggers into poisoned knowledge while ensuring the information remains factually accurate for stealthiness. We conduct extensive experiments with three popular LLMs (GPT-3.5, LLaMA2, PaLM2), using two datasets (HighwayEnv, nuScenes), and demonstrate the effectiveness and stealthiness of our backdoor triggers and mechanisms. Finally, we critically assess the strengths and weaknesses of our proposed approaches, highlight the inherent vulnerabilities of LLMs in decision-making tasks, and evaluate potential defenses to safeguard LLM-based decision making systems.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion
Authors:
Shuyuan Tu,
Qi Dai,
Zihao Zhang,
Sicheng Xie,
Zhi-Qi Cheng,
Chong Luo,
Xintong Han,
Zuxuan Wu,
Yu-Gang Jiang
Abstract:
Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist's appearance and background. In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing. To introduce conditional controls to the denois…
▽ More
Despite impressive advancements in diffusion-based video editing models in altering video attributes, there has been limited exploration into modifying motion information while preserving the original protagonist's appearance and background. In this paper, we propose MotionFollower, a lightweight score-guided diffusion model for video motion editing. To introduce conditional controls to the denoising process, MotionFollower leverages two of our proposed lightweight signal controllers, one for poses and the other for appearances, both of which consist of convolution blocks without involving heavy attention calculations. Further, we design a score guidance principle based on a two-branch architecture, including the reconstruction and editing branches, which significantly enhance the modeling capability of texture details and complicated backgrounds. Concretely, we enforce several consistency regularizers and losses during the score estimation. The resulting gradients thus inject appropriate guidance to the intermediate latents, forcing the model to preserve the original background details and protagonists' appearances without interfering with the motion modification. Experiments demonstrate the competitive motion editing ability of MotionFollower qualitatively and quantitatively. Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory while delivering superior motion editing performance and exclusively supporting large camera movements and actions.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Safety Control of Service Robots with LLMs and Embodied Knowledge Graphs
Authors:
Yong Qi,
Gabriel Kyebambo,
Siyuan Xie,
Wei Shen,
Shenghui Wang,
Bitao Xie,
Bin He,
Zhipeng Wang,
Shuo Jiang
Abstract:
Safety limitations in service robotics across various industries have raised significant concerns about the need for robust mechanisms ensuring that robots adhere to safe practices, thereby preventing actions that might harm humans or cause property damage. Despite advances, including the integration of Knowledge Graphs (KGs) with Large Language Models (LLMs), challenges in ensuring consistent saf…
▽ More
Safety limitations in service robotics across various industries have raised significant concerns about the need for robust mechanisms ensuring that robots adhere to safe practices, thereby preventing actions that might harm humans or cause property damage. Despite advances, including the integration of Knowledge Graphs (KGs) with Large Language Models (LLMs), challenges in ensuring consistent safety in autonomous robot actions persist. In this paper, we propose a novel integration of Large Language Models with Embodied Robotic Control Prompts (ERCPs) and Embodied Knowledge Graphs (EKGs) to enhance the safety framework for service robots. ERCPs are designed as predefined instructions that ensure LLMs generate safe and precise responses. These responses are subsequently validated by EKGs, which provide a comprehensive knowledge base ensuring that the actions of the robot are continuously aligned with safety protocols, thereby promoting safer operational practices in varied contexts. Our experimental setup involved diverse real-world tasks, where robots equipped with our framework demonstrated significantly higher compliance with safety standards compared to traditional methods. This integration fosters secure human-robot interactions and positions our methodology at the forefront of AI-driven safety innovations in service robotics.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving
Authors:
Shaoyuan Xie,
Lingdong Kong,
Wenwei Zhang,
Jiawei Ren,
Liang Pan,
Kai Chen,
Ziwei Liu
Abstract:
Recent advancements in bird's eye view (BEV) representations have shown remarkable promise for in-vehicle 3D perception. However, while these methods have achieved impressive results on standard benchmarks, their robustness in varied conditions remains insufficiently assessed. In this study, we present RoboBEV, an extensive benchmark suite designed to evaluate the resilience of BEV algorithms. Thi…
▽ More
Recent advancements in bird's eye view (BEV) representations have shown remarkable promise for in-vehicle 3D perception. However, while these methods have achieved impressive results on standard benchmarks, their robustness in varied conditions remains insufficiently assessed. In this study, we present RoboBEV, an extensive benchmark suite designed to evaluate the resilience of BEV algorithms. This suite incorporates a diverse set of camera corruption types, each examined over three severity levels. Our benchmarks also consider the impact of complete sensor failures that occur when using multi-modal models. Through RoboBEV, we assess 33 state-of-the-art BEV-based perception models spanning tasks like detection, map segmentation, depth estimation, and occupancy prediction. Our analyses reveal a noticeable correlation between the model's performance on in-distribution datasets and its resilience to out-of-distribution challenges. Our experimental results also underline the efficacy of strategies like pre-training and depth-free BEV transformations in enhancing robustness against out-of-distribution data. Furthermore, we observe that leveraging extensive temporal information significantly improves the model's robustness. Based on our observations, we design an effective robustness enhancement strategy based on the CLIP model. The insights from this study pave the way for the development of future BEV models that seamlessly combine accuracy with real-world robustness.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Large Language Models (LLMs): Deployment, Tokenomics and Sustainability
Authors:
Haiwei Dong,
Shuang Xie
Abstract:
The rapid advancement of Large Language Models (LLMs) has significantly impacted human-computer interaction, epitomized by the release of GPT-4o, which introduced comprehensive multi-modality capabilities. In this paper, we first explored the deployment strategies, economic considerations, and sustainability challenges associated with the state-of-the-art LLMs. More specifically, we discussed the…
▽ More
The rapid advancement of Large Language Models (LLMs) has significantly impacted human-computer interaction, epitomized by the release of GPT-4o, which introduced comprehensive multi-modality capabilities. In this paper, we first explored the deployment strategies, economic considerations, and sustainability challenges associated with the state-of-the-art LLMs. More specifically, we discussed the deployment debate between Retrieval-Augmented Generation (RAG) and fine-tuning, highlighting their respective advantages and limitations. After that, we quantitatively analyzed the requirement of xPUs in training and inference. Additionally, for the tokenomics of LLM services, we examined the balance between performance and cost from the quality of experience (QoE)'s perspective of end users. Lastly, we envisioned the future hybrid architecture of LLM processing and its corresponding sustainability concerns, particularly in the environmental carbon footprint impact. Through these discussions, we provided a comprehensive overview of the operational and strategic considerations essential for the responsible development and deployment of LLMs.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
EM Distillation for One-step Diffusion Models
Authors:
Sirui Xie,
Zhisheng Xiao,
Diederik P Kingma,
Tingbo Hou,
Ying Nian Wu,
Kevin Patrick Murphy,
Tim Salimans,
Ben Poole,
Ruiqi Gao
Abstract:
While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Disti…
▽ More
While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models
Authors:
Hongyu Wang,
Jiayu Xu,
Senwei Xie,
Ruiping Wang,
Jialin Li,
Zhaojie Xie,
Bin Zhang,
Chuyan Xiong,
Xilin Chen
Abstract:
Multilingual multimodal reasoning is a core component in achieving human-level intelligence. However, most existing benchmarks for multilingual multimodal reasoning struggle to differentiate between models of varying performance; even language models without visual capabilities can easily achieve high scores. This leaves a comprehensive evaluation of leading multilingual multimodal models largely…
▽ More
Multilingual multimodal reasoning is a core component in achieving human-level intelligence. However, most existing benchmarks for multilingual multimodal reasoning struggle to differentiate between models of varying performance; even language models without visual capabilities can easily achieve high scores. This leaves a comprehensive evaluation of leading multilingual multimodal models largely unexplored. In this work, we introduce M4U, a novel and challenging benchmark for assessing the capability of multi-discipline multilingual multimodal understanding and reasoning. M4U contains 8,931 samples covering 64 disciplines across 16 subfields in Science, Engineering, and Healthcare in Chinese, English, and German. Using M4U, we conduct extensive evaluations of 21 leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) with external tools. The evaluation results show that the state-of-the-art model, GPT-4o, achieves only 47.6% average accuracy on M4U. Additionally, we observe that the leading LMMs exhibit significant language preferences. Our in-depth analysis indicates that leading LMMs, including GPT-4o, suffer performance degradation when prompted with cross-lingual multimodal questions, such as images with key textual information in Chinese while the question is in German. We believe that M4U can serve as a crucial tool for systematically evaluating LMMs based on their multilingual multimodal reasoning capabilities and monitoring their development. The homepage, codes and data are public available.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Authors:
Yuexiang Zhai,
Hao Bai,
Zipeng Lin,
Jiayi Pan,
Shengbang Tong,
Yifei Zhou,
Alane Suhr,
Saining Xie,
Yann LeCun,
Yi Ma,
Sergey Levine
Abstract:
Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic…
▽ More
Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.
△ Less
Submitted 16 May, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Evaluation scheme for children-centered language interaction competence of AI-driven robots
Authors:
Siqi Xie,
Jiantao Li
Abstract:
This article explores the evaluation method for the language communication proficiency of AI-driven robots engaging in interactive communication with children. The utilization of AI-driven robots in children's everyday communication is swiftly advancing, underscoring the importance of evaluating these robots'language communication skills. Based on 11 Chinese families' interviews and thematic analy…
▽ More
This article explores the evaluation method for the language communication proficiency of AI-driven robots engaging in interactive communication with children. The utilization of AI-driven robots in children's everyday communication is swiftly advancing, underscoring the importance of evaluating these robots'language communication skills. Based on 11 Chinese families' interviews and thematic analysis of the comment text from shopping websites, a framework is introduced in the article to assess five key dimensions of child-robot language communication: interactivity, specificity, development, sociality, and safety. We draw on the concept of "children's agency", viewing children as active participants in shaping society and cultural life alongside adults. Therefore, this article places particular emphasis on collecting data related to children. Whether through survey interviews or direct interactive experiments, we treat children as an independent object for data collection. The study involved empirical research following the mentioned framework, which involved capturing interaction videos in natural conversation settings among children from 6 families. Analysis was performed on quantitative data obtained from video recordings, alongside questionnaires and interviews carried out by parents acting as participants or observers. We found that the presence or absence of parents during children's interactions with robots can impact the evaluation of robots'language communication abilities. Ultimately, this article proposes an enhanced comprehensive evaluation framework incorporating insights from parents and children, supported by empirical evidence and inter-rater consistency assessments, showcasing the scheme's efficacy.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
The RoboDrive Challenge: Drive Anytime Anywhere in Any Condition
Authors:
Lingdong Kong,
Shaoyuan Xie,
Hanjiang Hu,
Yaru Niu,
Wei Tsang Ooi,
Benoit R. Cottereau,
Lai Xing Ng,
Yuexin Ma,
Wenwei Zhang,
Liang Pan,
Kai Chen,
Ziwei Liu,
Weichao Qiu,
Wei Zhang,
Xu Cao,
Hao Lu,
Ying-Cong Chen,
Caixin Kang,
Xinning Zhou,
Chengyang Ying,
Wentao Shang,
Xingxing Wei,
Yinpeng Dong,
Bo Yang,
Shengyin Jiang
, et al. (66 additional authors not shown)
Abstract:
In the realm of autonomous driving, robust perception under out-of-distribution conditions is paramount for the safe deployment of vehicles. Challenges such as adverse weather, sensor malfunctions, and environmental unpredictability can severely impact the performance of autonomous systems. The 2024 RoboDrive Challenge was crafted to propel the development of driving perception technologies that c…
▽ More
In the realm of autonomous driving, robust perception under out-of-distribution conditions is paramount for the safe deployment of vehicles. Challenges such as adverse weather, sensor malfunctions, and environmental unpredictability can severely impact the performance of autonomous systems. The 2024 RoboDrive Challenge was crafted to propel the development of driving perception technologies that can withstand and adapt to these real-world variabilities. Focusing on four pivotal tasks -- BEV detection, map segmentation, semantic occupancy prediction, and multi-view depth estimation -- the competition laid down a gauntlet to innovate and enhance system resilience against typical and atypical disturbances. This year's challenge consisted of five distinct tracks and attracted 140 registered teams from 93 institutes across 11 countries, resulting in nearly one thousand submissions evaluated through our servers. The competition culminated in 15 top-performing solutions, which introduced a range of innovative approaches including advanced data augmentation, multi-sensor fusion, self-supervised learning for error correction, and new algorithmic strategies to enhance sensor robustness. These contributions significantly advanced the state of the art, particularly in handling sensor inconsistencies and environmental variability. Participants, through collaborative efforts, pushed the boundaries of current technologies, showcasing their potential in real-world scenarios. Extensive evaluations and analyses provided insights into the effectiveness of these solutions, highlighting key trends and successful strategies for improving the resilience of driving perception systems. This challenge has set a new benchmark in the field, providing a rich repository of techniques expected to guide future research in this field.
△ Less
Submitted 29 May, 2024; v1 submitted 14 May, 2024;
originally announced May 2024.
-
Enhancing Low-Energy Neutron and Gamma Ray Detection Using Convolutional Neural Networks with EJ-276 Scintillators
Authors:
Fengzhao Shen,
Tao Li,
Jingkui He,
Shenghui Xie,
Yuehuan Wei,
Tuchen Huang,
Wei Wang
Abstract:
Organic scintillators, such as plastic scintillators, are widely used to detect fast neutrons and gamma rays. The EJ-276 scintillator offers a versatile solution for detecting fast neutrons and gamma rays simultaneously, making it ideal for mixed neutron-gamma field detection applications. This study evaluates the Pulse Shape Discrimination (PSD) capabilities of the EJ-276 scintillator paired with…
▽ More
Organic scintillators, such as plastic scintillators, are widely used to detect fast neutrons and gamma rays. The EJ-276 scintillator offers a versatile solution for detecting fast neutrons and gamma rays simultaneously, making it ideal for mixed neutron-gamma field detection applications. This study evaluates the Pulse Shape Discrimination (PSD) capabilities of the EJ-276 scintillator paired with silicon photomultiplier (SiPM) array readouts. Integrating the 1-inch EJ-276 scintillator with SiPM arrays achieved a Figure of Merit (FOM) of 1.13 at an energy threshold of 200 keVee (electron equivalent). However, using the Charge Comparison Method (CCM) to distinguish between neutrons and gamma rays was challenging, especially at energies below 200 keVee. To improve low-energy resolution, the Convolutional Neural Network (CNN) approach was adopted. The InceptionTime and EfficientNetV2 models were developed, using one-dimensional time series and two-dimensional matrix image inputs, respectively. The transformation from one-dimensional arrays to two-dimensional images was achieved using three techniques: Gramian Angular Summation Field(GASF), Recurrence Plot(RP), and Relative Position Matrix(RPM). These methods demonstrated high accuracy at energy levels above 200 keVee. At lower energy regions, CNN methods, particularly the InceptionTime model, outperformed CCM methods. Notably, CNN methods reached accuracies of 96.79% and 98.33% in the 0-100 keVee and 100-200 keVee ranges, respectively, significantly higher than the 85.49% and 94.56% achieved by CCM, representing improvements of 13.22% and 3.99%. These results highlight the superior performance of CNN methods in differentiating between neutrons and gamma rays, especially in low-energy regions.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Forms in prime variables and differing degrees
Authors:
Jianya Liu,
Sizhe Xie
Abstract:
Let $F_1,\ldots,F_R$ be homogeneous polynomials with integer coefficients in $n$ variables with differing degrees. Write $\boldsymbol{F}=(F_1,\ldots,F_R)$ with $D$ being the maximal degree. Suppose that $\boldsymbol{F}$ is a nonsingular system and $n\ge D^2 4^{D+6}R^5$. We prove an asymptotic formula for the number of prime solutions to $\boldsymbol{F}(\boldsymbol{x})=\boldsymbol{0}$, whose main t…
▽ More
Let $F_1,\ldots,F_R$ be homogeneous polynomials with integer coefficients in $n$ variables with differing degrees. Write $\boldsymbol{F}=(F_1,\ldots,F_R)$ with $D$ being the maximal degree. Suppose that $\boldsymbol{F}$ is a nonsingular system and $n\ge D^2 4^{D+6}R^5$. We prove an asymptotic formula for the number of prime solutions to $\boldsymbol{F}(\boldsymbol{x})=\boldsymbol{0}$, whose main term is positive if (i) $\boldsymbol{F}(\boldsymbol{x})=\boldsymbol{0}$ has a nonsingular solution over the $p$-adic units $\mathbb{U}_p$ for all primes $p$, and (ii) $\boldsymbol{F}(\boldsymbol{x})=\boldsymbol{0}$ has a nonsingular solution in the open cube $(0,1)^n$. This can be viewed as a smooth local-global principle for $\boldsymbol{F}(\boldsymbol{x})=\boldsymbol{0}$ with differing degrees. It follows that, under (i) and (ii), the set of prime solutions to $\boldsymbol{F}(\boldsymbol{x})=\boldsymbol{0}$ is Zariski dense in the set of its solutions.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Picking watermarks from noise (PWFN): an improved robust watermarking model against intensive distortions
Authors:
Sijing Xie,
Chengxin Zhao,
Nan Sun,
Wei Li,
Hefei Ling
Abstract:
Digital watermarking is the process of embedding secret information by altering images in an undetectable way to the human eye. To increase the robustness of the model, many deep learning-based watermarking methods use the encoder-noise-decoder architecture by adding different noises to the noise layer. The decoder then extracts the watermarked information from the distorted image. However, this m…
▽ More
Digital watermarking is the process of embedding secret information by altering images in an undetectable way to the human eye. To increase the robustness of the model, many deep learning-based watermarking methods use the encoder-noise-decoder architecture by adding different noises to the noise layer. The decoder then extracts the watermarked information from the distorted image. However, this method can only resist weak noise attacks. To improve the robustness of the decoder against stronger noise, this paper proposes to introduce a denoise module between the noise layer and the decoder. The module aims to reduce noise and recover some of the information lost caused by distortion. Additionally, the paper introduces the SE module to fuse the watermarking information pixel-wise and channel dimensions-wise, improving the encoder's efficiency. Experimental results show that our proposed method is comparable to existing models and outperforms state-of-the-art under different noise intensities. In addition, ablation experiments show the superiority of our proposed module.
△ Less
Submitted 17 May, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
SSyncOA: Self-synchronizing Object-aligned Watermarking to Resist Cropping-paste Attacks
Authors:
Chengxin Zhao,
Hefei Ling,
Sijing Xie,
Han Fang,
Yaokun Fang,
Nan Sun
Abstract:
Modern image processing tools have made it easy for attackers to crop the region or object of interest in images and paste it into other images. The challenge this cropping-paste attack poses to the watermarking technology is that it breaks the synchronization of the image watermark, introducing multiple superimposed desynchronization distortions, such as rotation, scaling, and translation. Howeve…
▽ More
Modern image processing tools have made it easy for attackers to crop the region or object of interest in images and paste it into other images. The challenge this cropping-paste attack poses to the watermarking technology is that it breaks the synchronization of the image watermark, introducing multiple superimposed desynchronization distortions, such as rotation, scaling, and translation. However, current watermarking methods can only resist a single type of desynchronization and cannot be applied to protect the object's copyright under the cropping-paste attack. With the finding that the key to resisting the cropping-paste attack lies in robust features of the object to protect, this paper proposes a self-synchronizing object-aligned watermarking method, called SSyncOA. Specifically, we first constrain the watermarked region to be aligned with the protected object, and then synchronize the watermark's translation, rotation, and scaling distortions by normalizing the object invariant features, i.e., its centroid, principal orientation, and minimum bounding square, respectively. To make the watermark embedded in the protected object, we introduce the object-aligned watermarking model, which incorporates the real cropping-paste attack into the encoder-noise layer-decoder pipeline and is optimized end-to-end. Besides, we illustrate the effect of different desynchronization distortions on the watermark training, which confirms the necessity of the self-synchronization process. Extensive experiments demonstrate the superiority of our method over other SOTAs.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
DBDH: A Dual-Branch Dual-Head Neural Network for Invisible Embedded Regions Localization
Authors:
Chengxin Zhao,
Hefei Ling,
Sijing Xie,
Nan Sun,
Zongyi Li,
Yuxuan Shi,
Jiazhong Chen
Abstract:
Embedding invisible hyperlinks or hidden codes in images to replace QR codes has become a hot topic recently. This technology requires first localizing the embedded region in the captured photos before decoding. Existing methods that train models to find the invisible embedded region struggle to obtain accurate localization results, leading to degraded decoding accuracy. This limitation is primari…
▽ More
Embedding invisible hyperlinks or hidden codes in images to replace QR codes has become a hot topic recently. This technology requires first localizing the embedded region in the captured photos before decoding. Existing methods that train models to find the invisible embedded region struggle to obtain accurate localization results, leading to degraded decoding accuracy. This limitation is primarily because the CNN network is sensitive to low-frequency signals, while the embedded signal is typically in the high-frequency form. Based on this, this paper proposes a Dual-Branch Dual-Head (DBDH) neural network tailored for the precise localization of invisible embedded regions. Specifically, DBDH uses a low-level texture branch containing 62 high-pass filters to capture the high-frequency signals induced by embedding. A high-level context branch is used to extract discriminative features between the embedded and normal regions. DBDH employs a detection head to directly detect the four vertices of the embedding region. In addition, we introduce an extra segmentation head to segment the mask of the embedding region during training. The segmentation head provides pixel-level supervision for model learning, facilitating better learning of the embedded signals. Based on two state-of-the-art invisible offline-to-online messaging methods, we construct two datasets and augmentation strategies for training and testing localization models. Extensive experiments demonstrate the superior performance of the proposed DBDH over existing methods.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Out-of-distribution Detection in Medical Image Analysis: A survey
Authors:
Zesheng Hong,
Yubiao Yue,
Yubin Chen,
Lele Cong,
Huanjie Lin,
Yuanmei Luo,
Mini Han Wang,
Weidong Wang,
Jialong Xu,
Xiaoqi Yang,
Hechang Chen,
Zhenzhang Li,
Sihong Xie
Abstract:
Computer-aided diagnostics has benefited from the development of deep learning-based computer vision techniques in these years. Traditional supervised deep learning methods assume that the test sample is drawn from the identical distribution as the training data. However, it is possible to encounter out-of-distribution samples in real-world clinical scenarios, which may cause silent failure in dee…
▽ More
Computer-aided diagnostics has benefited from the development of deep learning-based computer vision techniques in these years. Traditional supervised deep learning methods assume that the test sample is drawn from the identical distribution as the training data. However, it is possible to encounter out-of-distribution samples in real-world clinical scenarios, which may cause silent failure in deep learning-based medical image analysis tasks. Recently, research has explored various out-of-distribution (OOD) detection situations and techniques to enable a trustworthy medical AI system. In this survey, we systematically review the recent advances in OOD detection in medical image analysis. We first explore several factors that may cause a distributional shift when using a deep-learning-based model in clinic scenarios, with three different types of distributional shift well defined on top of these factors. Then a framework is suggested to categorize and feature existing solutions, while the previous studies are reviewed based on the methodology taxonomy. Our discussion also includes evaluation protocols and metrics, as well as the challenge and a research direction lack of exploration.
△ Less
Submitted 3 July, 2024; v1 submitted 28 April, 2024;
originally announced April 2024.
-
MoDE: CLIP Data Experts via Clustering
Authors:
Jiawei Ma,
Po-Yao Huang,
Saining Xie,
Shang-Wen Li,
Luke Zettlemoyer,
Shih-Fu Chang,
Wen-Tau Yih,
Hu Xu
Abstract:
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inferen…
▽ More
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Uncertainty Quantification on Graph Learning: A Survey
Authors:
Chao Chen,
Chenghua Guo,
Rui Xu,
Xiangwen Liao,
Xi Zhang,
Sihong Xie,
Hui Xiong,
Philip Yu
Abstract:
Graphical models, including Graph Neural Networks (GNNs) and Probabilistic Graphical Models (PGMs), have demonstrated their exceptional capabilities across numerous fields. These models necessitate effective uncertainty quantification to ensure reliable decision-making amid the challenges posed by model training discrepancies and unpredictable testing scenarios. This survey examines recent works t…
▽ More
Graphical models, including Graph Neural Networks (GNNs) and Probabilistic Graphical Models (PGMs), have demonstrated their exceptional capabilities across numerous fields. These models necessitate effective uncertainty quantification to ensure reliable decision-making amid the challenges posed by model training discrepancies and unpredictable testing scenarios. This survey examines recent works that address uncertainty quantification within the model architectures, training, and inference of GNNs and PGMs. We aim to provide an overview of the current landscape of uncertainty in graphical models by organizing the recent methods into uncertainty representation and handling. By summarizing state-of-the-art methods, this survey seeks to deepen the understanding of uncertainty quantification in graphical models, thereby increasing their effectiveness and safety in critical applications.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Children's Overtrust and Shifting Perspectives of Generative AI
Authors:
Jaemarie Solyst,
Ellia Yang,
Shixian Xie,
Jessica Hammer,
Amy Ogan,
Motahhare Eslami
Abstract:
The capabilities of generative AI (genAI) have dramatically increased in recent times, and there are opportunities for children to leverage new features for personal and school-related endeavors. However, while the future of genAI is taking form, there remain potentially harmful limitations, such as generation of outputs with misinformation and bias. We ran a workshop study focused on ChatGPT to e…
▽ More
The capabilities of generative AI (genAI) have dramatically increased in recent times, and there are opportunities for children to leverage new features for personal and school-related endeavors. However, while the future of genAI is taking form, there remain potentially harmful limitations, such as generation of outputs with misinformation and bias. We ran a workshop study focused on ChatGPT to explore middle school girls' (N = 26) attitudes and reasoning about how genAI works. We focused on girls who are often disproportionately impacted by algorithmic bias. We found that: (1) middle school girls were initially overtrusting of genAI, (2) deliberate exposure to the limitations and mistakes of generative AI shifted this overtrust to disillusionment about genAI capabilities, though they were still optimistic for future possibilities of genAI, and (3) their ideas about school policy were nuanced. This work informs how children think about genAI like ChatGPT and its integration in learning settings.
△ Less
Submitted 29 June, 2024; v1 submitted 22 April, 2024;
originally announced April 2024.
-
DSDRNet: Disentangling Representation and Reconstruct Network for Domain Generalization
Authors:
Juncheng Yang,
Zuchao Li,
Shuai Xie,
Wei Yu,
Shijun Li
Abstract:
Domain generalization faces challenges due to the distribution shift between training and testing sets, and the presence of unseen target domains. Common solutions include domain alignment, meta-learning, data augmentation, or ensemble learning, all of which rely on domain labels or domain adversarial techniques. In this paper, we propose a Dual-Stream Separation and Reconstruction Network, dubbed…
▽ More
Domain generalization faces challenges due to the distribution shift between training and testing sets, and the presence of unseen target domains. Common solutions include domain alignment, meta-learning, data augmentation, or ensemble learning, all of which rely on domain labels or domain adversarial techniques. In this paper, we propose a Dual-Stream Separation and Reconstruction Network, dubbed DSDRNet. It is a disentanglement-reconstruction approach that integrates features of both inter-instance and intra-instance through dual-stream fusion. The method introduces novel supervised signals by combining inter-instance semantic distance and intra-instance similarity. Incorporating Adaptive Instance Normalization (AdaIN) into a two-stage cyclic reconstruction process enhances self-disentangled reconstruction signals to facilitate model convergence. Extensive experiments on four benchmark datasets demonstrate that DSDRNet outperforms other popular methods in terms of domain generalization capabilities.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models
Authors:
Juncheng Yang,
Zuchao Li,
Shuai Xie,
Weiping Zhu,
Wei Yu,
Shijun Li
Abstract:
Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cro…
▽ More
Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model performance through adaptive adjustment of sample learning intensity. Extensive experimental results on benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly regarding accuracy, generalization, and efficiency.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Finding the Particularity of the Active Episode of SGR J1935+2154 during Which FRB 20200428 Occurred: Implication from Statistics of Fermi/GBM X-Ray Bursts
Authors:
Sheng-Lun Xie,
Yun-Wei Yu,
Shao-Lin Xiong,
Lin Lin,
Ping Wang,
Yi Zhao,
Yue Wang,
Wen-Long Zhang
Abstract:
By using the Fermi/Gamma-ray Burst Monitor data of the X-ray bursts (XRBs) of SGR J1935+2154, we investigate the temporal clustering of the bursts and the cumulative distribution of the waiting time and fluence/flux. It is found that the bursts occurring in the episode hosting FRB 20200428 have obviously shorter waiting times than those in the other episodes. The general statistical properties of…
▽ More
By using the Fermi/Gamma-ray Burst Monitor data of the X-ray bursts (XRBs) of SGR J1935+2154, we investigate the temporal clustering of the bursts and the cumulative distribution of the waiting time and fluence/flux. It is found that the bursts occurring in the episode hosting FRB 20200428 have obviously shorter waiting times than those in the other episodes. The general statistical properties of the XRBs further indicate they could belong to a self-organized critical (SOC) system (e.g., starquakes), making them very similar to the earthquake phenomena. Then, according to a unified scaling law between the waiting time and energy of the earthquakes as well as their aftershocks, we implement an analogy analysis on the XRBs and find that the FRB episode owns more dependent burst events than the other episodes. It is indicated that the fast radio burst (FRB) emission could be produced by the interaction between different burst events, which could correspond to a collision between different seismic/Alfven waves or different explosion outflows. Such a situation could appear when the magnetar enters into a global intensive activity period.
△ Less
Submitted 8 June, 2024; v1 submitted 17 April, 2024;
originally announced April 2024.
-
Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households
Authors:
Zhihao Cao,
Zidong Wang,
Siwen Xie,
Anji Liu,
Lifeng Fan
Abstract:
Despite the significant demand for assistive technology among vulnerable groups (e.g., the elderly, children, and the disabled) in daily tasks, research into advanced AI-driven assistive solutions that genuinely accommodate their diverse needs remains sparse. Traditional human-machine interaction tasks often require machines to simply help without nuanced consideration of human abilities and feeli…
▽ More
Despite the significant demand for assistive technology among vulnerable groups (e.g., the elderly, children, and the disabled) in daily tasks, research into advanced AI-driven assistive solutions that genuinely accommodate their diverse needs remains sparse. Traditional human-machine interaction tasks often require machines to simply help without nuanced consideration of human abilities and feelings, such as their opportunity for practice and learning, sense of self-improvement, and self-esteem. Addressing this gap, we define a pivotal and novel challenge Smart Help, which aims to provide proactive yet adaptive support to human agents with diverse disabilities and dynamic goals in various tasks and environments. To establish this challenge, we leverage AI2-THOR to build a new interactive 3D realistic household environment for the Smart Help task. We introduce an innovative opponent modeling module that provides a nuanced understanding of the main agent's capabilities and goals, in order to optimize the assisting agent's helping policy. Rigorous experiments validate the efficacy of our model components and show the superiority of our holistic approach against established baselines. Our findings illustrate the potential of AI-imbued assistive robots in improving the well-being of vulnerable groups.
△ Less
Submitted 13 April, 2024;
originally announced April 2024.
-
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models
Authors:
Shibo Hao,
Yi Gu,
Haotian Luo,
Tianyang Liu,
Xiyan Shao,
Xinyuan Wang,
Shuhua Xie,
Haodi Ma,
Adithya Samavedhi,
Qiyue Gao,
Zhen Wang,
Zhiting Hu
Abstract:
Generating accurate step-by-step reasoning is essential for Large Language Models (LLMs) to address complex problems and enhance robustness and interpretability. Despite the flux of research on developing advanced reasoning approaches, systematically analyzing the diverse LLMs and reasoning strategies in generating reasoning chains remains a significant challenge. The difficulties stem from the la…
▽ More
Generating accurate step-by-step reasoning is essential for Large Language Models (LLMs) to address complex problems and enhance robustness and interpretability. Despite the flux of research on developing advanced reasoning approaches, systematically analyzing the diverse LLMs and reasoning strategies in generating reasoning chains remains a significant challenge. The difficulties stem from the lack of two key elements: (1) an automatic method for evaluating the generated reasoning chains on different tasks, and (2) a unified formalism and implementation of the diverse reasoning approaches for systematic comparison. This paper aims to close the gap: (1) We introduce AutoRace for fully automated reasoning chain evaluation. Existing metrics rely on expensive human annotations or pre-defined LLM prompts not adaptable to different tasks. In contrast, AutoRace automatically creates detailed evaluation criteria tailored for each task, and uses GPT-4 for accurate evaluation following the criteria. (2) We develop LLM Reasoners, a library for standardized modular implementation of existing and new reasoning algorithms, under a unified formulation of the search, reward, and world model components. With the new evaluation and library, (3) we conduct extensive study of different reasoning approaches (e.g., CoT, ToT, RAP). The analysis reveals interesting findings about different factors contributing to reasoning, including the reward-guidance, breadth-vs-depth in search, world model, and prompt formats, etc.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning
Authors:
Juncheng Yang,
Zuchao Li,
Shuai Xie,
Wei Yu,
Shijun Li,
Bo Du
Abstract:
The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a…
▽ More
The chain-of-thought technique has been received well in multi-modal tasks. It is a step-by-step linear reasoning process that adjusts the length of the chain to improve the performance of generated prompts. However, human thought processes are predominantly non-linear, as they encompass multiple aspects simultaneously and employ dynamic adjustment and updating mechanisms. Therefore, we propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning. The proposed AGoT models the human thought process not only as a chain but also models each step as a reasoning aggregation graph to cope with the overlooked multiple aspects of thinking in single-step reasoning. This turns the entire reasoning process into prompt aggregation and prompt flow operations. Experiments show that our multi-modal model enhanced with AGoT soft-prompting achieves good results in several tasks such as text-image retrieval, visual question answering, and image recognition. In addition, we demonstrate that it has good domain generalization performance due to better reasoning.
△ Less
Submitted 6 April, 2024;
originally announced April 2024.
-
Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization
Authors:
Shuo Xie,
Zhiyuan Li
Abstract:
Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized l…
▽ More
Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized loss, it is not clear if AdamW optimizes a specific objective. In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss under the constraint that the $\ell_\infty$ norm of the parameter is bounded by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to $\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay and Frank-Wolfe.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Grid-Mapping Pseudo-Count Constraint for Offline Reinforcement Learning
Authors:
Yi Shen,
Hanyan Huang,
Shan Xie
Abstract:
Offline reinforcement learning learns from a static dataset without interacting with the environment, which ensures security and thus owns a good prospect of application. However, directly applying naive reinforcement learning methods usually fails in an offline environment due to function approximation errors caused by out-of-distribution(OOD) actions. To solve this problem, existing algorithms m…
▽ More
Offline reinforcement learning learns from a static dataset without interacting with the environment, which ensures security and thus owns a good prospect of application. However, directly applying naive reinforcement learning methods usually fails in an offline environment due to function approximation errors caused by out-of-distribution(OOD) actions. To solve this problem, existing algorithms mainly penalize the Q-value of OOD actions, the quality of whose constraints also matter. Imprecise constraints may lead to suboptimal solutions, while precise constraints require significant computational costs. In this paper, we propose a novel count-based method for continuous domains, called Grid-Mapping Pseudo-Count method(GPC), to penalize the Q-value appropriately and reduce the computational cost. The proposed method maps the state and action space to discrete space and constrains their Q-values through the pseudo-count. It is theoretically proved that only a few conditions are needed to obtain accurate uncertainty constraints in the proposed method. Moreover, we develop a Grid-Mapping Pseudo-Count Soft Actor-Critic(GPC-SAC) algorithm using GPC under the Soft Actor-Critic(SAC) framework to demonstrate the effectiveness of GPC. The experimental results on D4RL benchmark datasets show that GPC-SAC has better performance and less computational cost compared to other algorithms.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Nonreciprocal interactions in crowd dynamics: investigating the impact of moving threats on pedestrian speed preferences
Authors:
Shaocong Xie,
Rui Ye,
Xiaolian Li,
Zhongyi Huang,
Shuchao Cao,
Wei Lv,
Hong He,
Ping Zhang,
Zhiming Fang,
Jun Zhang,
Weiguo Song
Abstract:
Nonreciprocal interaction crowd systems, such as human-human, human-vehicle, and human-robot systems, often have serious impacts on pedestrian safety and social order. A more comprehensive understanding of these systems is needed to optimize system stability and efficiency. Despite the importance of these interactions, empirical research in this area remains limited. Thus, in our study we explore…
▽ More
Nonreciprocal interaction crowd systems, such as human-human, human-vehicle, and human-robot systems, often have serious impacts on pedestrian safety and social order. A more comprehensive understanding of these systems is needed to optimize system stability and efficiency. Despite the importance of these interactions, empirical research in this area remains limited. Thus, in our study we explore this underresearched area, focusing on scenarios where nonreciprocity plays a critical role, such as mass stabbings, which pose a substantial risk to public safety. We conducted the first experiments on this system and analysed high-accuracy data obtained from these experiments. The extent of the direct threat zone is determined by the speed of the moving threat and the radius of danger occurrence. We further categorize potential threats into direct, adjacent, and rear-view zones, quantifying the level of threat for pedestrians. Our study revealed that a pedestrian's desired velocity correlated positively with potential threat intensity, increasing until near the direct threat zone. An emerging steady state is observed when escape routes are blocked by moving threats. This deviation affects the density-velocity relationship, making it distinct from the general relationship. This deviation signifies unique pedestrian behaviour in the presence of moving threats. Additionally, the rate of change in the angle for pedestrian motion in various desired directions is synchronized. This indicates the emergence of collective intelligence in nonreciprocal interaction crowd systems. As a result, our study may constitute a pioneering step towards understanding nonreciprocal interactions in crowd systems through laboratory experiments. These findings may enhance pedestrian safety and inform not only government crowd management strategies but also individual self-protection measures.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Incorporating Domain Differential Equations into Graph Convolutional Networks to Lower Generalization Discrepancy
Authors:
Yue Sun,
Chao Chen,
Yuesheng Xu,
Sihong Xie,
Rick S. Blum,
Parv Venkitasubramaniam
Abstract:
Ensuring both accuracy and robustness in time series prediction is critical to many applications, ranging from urban planning to pandemic management. With sufficient training data where all spatiotemporal patterns are well-represented, existing deep-learning models can make reasonably accurate predictions. However, existing methods fail when the training data are drawn from different circumstances…
▽ More
Ensuring both accuracy and robustness in time series prediction is critical to many applications, ranging from urban planning to pandemic management. With sufficient training data where all spatiotemporal patterns are well-represented, existing deep-learning models can make reasonably accurate predictions. However, existing methods fail when the training data are drawn from different circumstances (e.g., traffic patterns on regular days) compared to test data (e.g., traffic patterns after a natural disaster). Such challenges are usually classified under domain generalization. In this work, we show that one way to address this challenge in the context of spatiotemporal prediction is by incorporating domain differential equations into Graph Convolutional Networks (GCNs). We theoretically derive conditions where GCNs incorporating such domain differential equations are robust to mismatched training and testing data compared to baseline domain agnostic models. To support our theory, we propose two domain-differential-equation-informed networks called Reaction-Diffusion Graph Convolutional Network (RDGCN), which incorporates differential equations for traffic speed evolution, and Susceptible-Infectious-Recovered Graph Convolutional Network (SIRGCN), which incorporates a disease propagation model. Both RDGCN and SIRGCN are based on reliable and interpretable domain differential equations that allow the models to generalize to unseen patterns. We experimentally show that RDGCN and SIRGCN are more robust with mismatched testing data than the state-of-the-art deep learning methods.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Blockchain-based Pseudonym Management for Vehicle Twin Migrations in Vehicular Edge Metaverse
Authors:
Jiawen Kang,
Xiaofeng Luo,
Jiangtian Nie,
Tianhao Wu,
Haibo Zhou,
Yonghua Wang,
Dusit Niyato,
Shiwen Mao,
Shengli Xie
Abstract:
Driven by the great advances in metaverse and edge computing technologies, vehicular edge metaverses are expected to disrupt the current paradigm of intelligent transportation systems. As highly computerized avatars of Vehicular Metaverse Users (VMUs), the Vehicle Twins (VTs) deployed in edge servers can provide valuable metaverse services to improve driving safety and on-board satisfaction for th…
▽ More
Driven by the great advances in metaverse and edge computing technologies, vehicular edge metaverses are expected to disrupt the current paradigm of intelligent transportation systems. As highly computerized avatars of Vehicular Metaverse Users (VMUs), the Vehicle Twins (VTs) deployed in edge servers can provide valuable metaverse services to improve driving safety and on-board satisfaction for their VMUs throughout journeys. To maintain uninterrupted metaverse experiences, VTs must be migrated among edge servers following the movements of vehicles. This can raise concerns about privacy breaches during the dynamic communications among vehicular edge metaverses. To address these concerns and safeguard location privacy, pseudonyms as temporary identifiers can be leveraged by both VMUs and VTs to realize anonymous communications in the physical space and virtual spaces. However, existing pseudonym management methods fall short in meeting the extensive pseudonym demands in vehicular edge metaverses, thus dramatically diminishing the performance of privacy preservation. To this end, we present a cross-metaverse empowered dual pseudonym management framework. We utilize cross-chain technology to enhance management efficiency and data security for pseudonyms. Furthermore, we propose a metric to assess the privacy level and employ a Multi-Agent Deep Reinforcement Learning (MADRL) approach to obtain an optimal pseudonym generating strategy. Numerical results demonstrate that our proposed schemes are high-efficiency and cost-effective, showcasing their promising applications in vehicular edge metaverses.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Robust Conformal Prediction under Distribution Shift via Physics-Informed Structural Causal Model
Authors:
Rui Xu,
Yue Sun,
Chao Chen,
Parv Venkitasubramaniam,
Sihong Xie
Abstract:
Uncertainty is critical to reliable decision-making with machine learning. Conformal prediction (CP) handles uncertainty by predicting a set on a test input, hoping the set to cover the true label with at least $(1-α)$ confidence. This coverage can be guaranteed on test data even if the marginal distributions $P_X$ differ between calibration and test datasets. However, as it is common in practice,…
▽ More
Uncertainty is critical to reliable decision-making with machine learning. Conformal prediction (CP) handles uncertainty by predicting a set on a test input, hoping the set to cover the true label with at least $(1-α)$ confidence. This coverage can be guaranteed on test data even if the marginal distributions $P_X$ differ between calibration and test datasets. However, as it is common in practice, when the conditional distribution $P_{Y|X}$ is different on calibration and test data, the coverage is not guaranteed and it is essential to measure and minimize the coverage loss under distributional shift at \textit{all} possible confidence levels. To address these issues, we upper bound the coverage difference at all levels using the cumulative density functions of calibration and test conformal scores and Wasserstein distance. Inspired by the invariance of physics across data distributions, we propose a physics-informed structural causal model (PI-SCM) to reduce the upper bound. We validated that PI-SCM can improve coverage robustness along confidence level and test domain on a traffic speed prediction task and an epidemic spread task with multiple real-world datasets.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Graph Attention Network-based Block Propagation with Optimal AoI and Reputation in Web 3.0
Authors:
Jiana Liao,
Jinbo Wen,
Jiawen Kang,
Changyan Yi,
Yang Zhang,
Yutao Jiao,
Dusit Niyato,
Dong In Kim,
Shengli Xie
Abstract:
Web 3.0 is recognized as a pioneering paradigm that empowers users to securely oversee data without reliance on a centralized authority. Blockchains, as a core technology to realize Web 3.0, can facilitate decentralized and transparent data management. Nevertheless, the evolution of blockchain-enabled Web 3.0 is still in its nascent phase, grappling with challenges such as ensuring efficiency and…
▽ More
Web 3.0 is recognized as a pioneering paradigm that empowers users to securely oversee data without reliance on a centralized authority. Blockchains, as a core technology to realize Web 3.0, can facilitate decentralized and transparent data management. Nevertheless, the evolution of blockchain-enabled Web 3.0 is still in its nascent phase, grappling with challenges such as ensuring efficiency and reliability to enhance block propagation performance. In this paper, we design a Graph Attention Network (GAT)-based reliable block propagation optimization framework for blockchain-enabled Web 3.0. We first innovatively apply a data-freshness metric called age of block to measure block propagation efficiency in public blockchains. To achieve the reliability of block propagation, we introduce a reputation mechanism based on the subjective logic model, including the local and recommended opinions to calculate the miner reputation value. Moreover, considering that the GAT possesses the excellent ability to process graph-structured data, we utilize the GAT with reinforcement learning to obtain the optimal block propagation trajectory. Numerical results demonstrate that the proposed scheme exhibits the most outstanding block propagation efficiency and reliability compared with traditional routing mechanisms.
△ Less
Submitted 8 May, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.