-
Novel Artistic Scene-Centric Datasets for Effective Transfer Learning in Fragrant Spaces
Authors:
Shumei Liu,
Haiting Huang,
Mathias Zinnen,
Andreas Maier,
Vincent Christlein
Abstract:
Olfaction, often overlooked in cultural heritage studies, holds profound significance in shaping human experiences and identities. Examining historical depictions of olfactory scenes can offer valuable insights into the role of smells in history. We show that a transfer-learning approach using weakly labeled training data can remarkably improve the classification of fragrant spaces and, more gener…
▽ More
Olfaction, often overlooked in cultural heritage studies, holds profound significance in shaping human experiences and identities. Examining historical depictions of olfactory scenes can offer valuable insights into the role of smells in history. We show that a transfer-learning approach using weakly labeled training data can remarkably improve the classification of fragrant spaces and, more generally, artistic scene depictions. We fine-tune Places365-pre-trained models by querying two cultural heritage data sources and using the search terms as supervision signal. The models are evaluated on two manually corrected test splits. This work lays a foundation for further exploration of fragrant spaces recognition and artistic scene classification. All images and labels are released as the ArtPlaces dataset at https://zenodo.org/doi/10.5281/zenodo.11584328.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter
Authors:
Yaoyao Qian,
Xupeng Zhu,
Ondrej Biza,
Shuo Jiang,
Linfeng Zhao,
Haojie Huang,
Yu Qi,
Robert Platt
Abstract:
Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even wh…
▽ More
Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, ThinkGrasp achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models
Authors:
Chengguang Gan,
Qingyu Yin,
Xinyang He,
Hanjun Wei,
Yunhao Liang,
Younghun Lim,
Shijian Wang,
Hexiang Huang,
Qinghao Zhang,
Shiwen Ni,
Tatsunori Mori
Abstract:
The Mutual Reinforcement Effect (MRE) represents a promising avenue in information extraction and multitasking research. Nevertheless, its applicability has been constrained due to the exclusive availability of MRE mix datasets in Japanese, thereby limiting comprehensive exploration by the global research community. To address this limitation, we introduce a Multilingual MRE mix dataset (MMM) that…
▽ More
The Mutual Reinforcement Effect (MRE) represents a promising avenue in information extraction and multitasking research. Nevertheless, its applicability has been constrained due to the exclusive availability of MRE mix datasets in Japanese, thereby limiting comprehensive exploration by the global research community. To address this limitation, we introduce a Multilingual MRE mix dataset (MMM) that encompasses 21 sub-datasets in English, Japanese, and Chinese. In this paper, we also propose a method for dataset translation assisted by Large Language Models (LLMs), which significantly reduces the manual annotation time required for dataset construction by leveraging LLMs to translate the original Japanese datasets. Additionally, we have enriched the dataset by incorporating open-domain Named Entity Recognition (NER) and sentence classification tasks. Utilizing this expanded dataset, we developed a unified input-output framework to train an Open-domain Information Extraction Large Language Model (OIELLM). The OIELLM model demonstrates the capability to effectively process novel MMM datasets, exhibiting significant improvements in performance.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
First Measurement of Solar $^8$B Neutrino Flux through Coherent Elastic Neutrino-Nucleus Scattering in PandaX-4T
Authors:
PandaX Collaboration,
Zihao Bo,
Wei Chen,
Xun Chen,
Yunhua Chen,
Zhaokan Cheng,
Xiangyi Cui,
Yingjie Fan,
Deqing Fang,
Zhixing Gao,
Lisheng Geng,
Karl Giboni,
Xunan Guo,
Xuyuan Guo,
Zichao Guo,
Chencheng Han,
Ke Han,
Changda He,
Jinrong He,
Di Huang,
Houqi Huang,
Junting Huang,
Ruquan Hou,
Yu Hou,
Xiangdong Ji
, et al. (77 additional authors not shown)
Abstract:
The PandaX-4T liquid xenon detector at the China Jinping Underground Laboratory is used to measure the solar $^8$B neutrino flux by detecting neutrinos through coherent scattering with xenon nuclei. Data samples requiring the coincidence of scintillation and ionization signals (paired), as well as unpaired ionization-only signals (US2), are selected with energy threshold of approximately 1.1 keV (…
▽ More
The PandaX-4T liquid xenon detector at the China Jinping Underground Laboratory is used to measure the solar $^8$B neutrino flux by detecting neutrinos through coherent scattering with xenon nuclei. Data samples requiring the coincidence of scintillation and ionization signals (paired), as well as unpaired ionization-only signals (US2), are selected with energy threshold of approximately 1.1 keV (0.33 keV) nuclear recoil energy. Combining the commissioning run and the first science run of PandaX-4T, a total exposure of 1.25 and 1.04 tonne$\cdot$year are collected for the paired and US2, respectively. After unblinding, 3 and 332 events are observed with an expectation of 2.8$\pm$0.5 and 251$\pm$32 background events, for the paired and US2 data, respectively. A combined analysis yields a best-fit $^8$B neutrino signal of 3.5 (75) events from the paired (US2) data sample, with $\sim$37\% uncertainty, and the background-only hypothesis is disfavored at 2.64$σ$ significance. This gives a solar $^8$B neutrino flux of ($8.4\pm3.1$)$\times$10$^6$ cm$^{-2}$s$^{-1}$, consistent with the standard solar model prediction. This is the first indication of solar $^8$B neutrino ``fog'' in a dark matter direct detection experiment.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
FRI-Net: Floorplan Reconstruction via Room-wise Implicit Representation
Authors:
Honghao Xu,
Juzhan Xu,
Zeyu Huang,
Pengfei Xu,
Hui Huang,
Ruizhen Hu
Abstract:
In this paper, we introduce a novel method called FRI-Net for 2D floorplan reconstruction from 3D point cloud. Existing methods typically rely on corner regression or box regression, which lack consideration for the global shapes of rooms. To address these issues, we propose a novel approach using a room-wise implicit representation with structural regularization to characterize the shapes of room…
▽ More
In this paper, we introduce a novel method called FRI-Net for 2D floorplan reconstruction from 3D point cloud. Existing methods typically rely on corner regression or box regression, which lack consideration for the global shapes of rooms. To address these issues, we propose a novel approach using a room-wise implicit representation with structural regularization to characterize the shapes of rooms in floorplans. By incorporating geometric priors of room layouts in floorplans into our training strategy, the generated room polygons are more geometrically regular. We have conducted experiments on two challenging datasets, Structured3D and SceneCAD. Our method demonstrates improved performance compared to state-of-the-art methods, validating the effectiveness of our proposed representation for floorplan reconstruction.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction
Authors:
Lin Zhu,
Yunlong Zheng,
Yijun Zhang,
Xiao Wang,
Lizhi Wang,
Hua Huang
Abstract:
Event-based video reconstruction has garnered increasing attention due to its advantages, such as high dynamic range and rapid motion capture capabilities. However, current methods often prioritize the extraction of temporal information from continuous event flow, leading to an overemphasis on low-frequency texture features in the scene, resulting in over-smoothing and blurry artifacts. Addressing…
▽ More
Event-based video reconstruction has garnered increasing attention due to its advantages, such as high dynamic range and rapid motion capture capabilities. However, current methods often prioritize the extraction of temporal information from continuous event flow, leading to an overemphasis on low-frequency texture features in the scene, resulting in over-smoothing and blurry artifacts. Addressing this challenge necessitates the integration of conditional information, encompassing temporal features, low-frequency texture, and high-frequency events, to guide the Denoising Diffusion Probabilistic Model (DDPM) in producing accurate and natural outputs. To tackle this issue, we introduce a novel approach, the Temporal Residual Guided Diffusion Framework, which effectively leverages both temporal and frequency-based event priors. Our framework incorporates three key conditioning modules: a pre-trained low-frequency intensity estimation module, a temporal recurrent encoder module, and an attention-based high-frequency prior enhancement module. In order to capture temporal scene variations from the events at the current moment, we employ a temporal-domain residual image as the target for the diffusion model. Through the combination of these three conditioning paths and the temporal residual framework, our framework excels in reconstructing high-quality videos from event flow, mitigating issues such as artifacts and over-smoothing commonly observed in previous approaches. Extensive experiments conducted on multiple benchmark datasets validate the superior performance of our framework compared to prior event-based reconstruction methods.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Learning to Unlearn for Robust Machine Unlearning
Authors:
Mark He Huang,
Lin Geng Foo,
Jun Liu
Abstract:
Machine unlearning (MU) seeks to remove knowledge of specific data samples from trained models without the necessity for complete retraining, a task made challenging by the dual objectives of effective erasure of data and maintaining the overall performance of the model. Despite recent advances in this field, balancing between the dual objectives of unlearning remains challenging. From a fresh per…
▽ More
Machine unlearning (MU) seeks to remove knowledge of specific data samples from trained models without the necessity for complete retraining, a task made challenging by the dual objectives of effective erasure of data and maintaining the overall performance of the model. Despite recent advances in this field, balancing between the dual objectives of unlearning remains challenging. From a fresh perspective of generalization, we introduce a novel Learning-to-Unlearn (LTU) framework, which adopts a meta-learning approach to optimize the unlearning process to improve forgetting and remembering in a unified manner. LTU includes a meta-optimization scheme that facilitates models to effectively preserve generalizable knowledge with only a small subset of the remaining set, while thoroughly forgetting the specific data samples. We also introduce a Gradient Harmonization strategy to align the optimization trajectories for remembering and forgetting via mitigating gradient conflicts, thus ensuring efficient and effective model updates. Our approach demonstrates improved efficiency and efficacy for MU, offering a promising solution to the challenges of data rights and model reusability.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
The STAR Forward Silicon Tracker
Authors:
J. D. Brandenburg,
Y. Chang,
J. Dong,
Y. He,
Y. Hu,
H. Huang,
T. Huang,
H. Li,
M. Nie,
R. Sharma,
X. Sun,
P. Tribedy,
F. Videbæk,
G. Visser,
G. Wilks,
P. Wang,
G. Xie,
G. Yan,
Z. Ye,
L. Yi,
Y. Yang,
S. Zhang,
Z. Zhang
Abstract:
The Forward Silicon Tracker (FST) is a pivotal component of the forward upgrade of the Solenoidal Tracker at RHIC (STAR), designed to discern hadron charge signs with a momentum resolution better than 30\% for $0.2 < p_T < 2$ GeV/c in the $2.5 < η< 4$ pseudorapidity range. Its compact design features three disks along the beam direction, minimized material budget and scattering effects. The FST us…
▽ More
The Forward Silicon Tracker (FST) is a pivotal component of the forward upgrade of the Solenoidal Tracker at RHIC (STAR), designed to discern hadron charge signs with a momentum resolution better than 30\% for $0.2 < p_T < 2$ GeV/c in the $2.5 < η< 4$ pseudorapidity range. Its compact design features three disks along the beam direction, minimized material budget and scattering effects. The FST uses Hamamatsu's p-in-n silicon strip sensors with a double metal layer for efficient signal processing. The flexible hybrid boards, essential for the readout system, are constructed with Kapton and copper layers to optimize signal handling and power distribution. These boards connect silicon strips to analogue pipeline ASIC APV25-S1 chips, which read up to 128 channels each. A cooling system with nonconducting, volatile NOVEC 7200 coolant at 22.2°C mitigates ASIC-generated heat. The FST enhances forward tracking performance at RHIC, showcasing unique design solutions to complex challenges.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations
Authors:
Xiangzhu Kong,
Tianqi Ning,
Hao Huang,
Zhijian Ou
Abstract:
Recently multi-channel end-to-end (ME2E) ASR systems have emerged. While streaming single-channel end-to-end ASR has been extensively studied, streaming ME2E ASR is limited in exploration. Additionally, recent studies call attention to the gap between in-distribution (ID) and out-of-distribution (OOD) tests and doing realistic evaluations. This paper focuses on two research problems: realizing str…
▽ More
Recently multi-channel end-to-end (ME2E) ASR systems have emerged. While streaming single-channel end-to-end ASR has been extensively studied, streaming ME2E ASR is limited in exploration. Additionally, recent studies call attention to the gap between in-distribution (ID) and out-of-distribution (OOD) tests and doing realistic evaluations. This paper focuses on two research problems: realizing streaming ME2E ASR and improving OOD generalization. We propose the CUSIDE-array method, which integrates the recent CUSIDE methodology (Chunking, Simulating Future Context and Decoding) into the neural beamformer approach of ME2E ASR. It enables streaming processing of both front-end and back-end with a total latency of 402ms. The CUSIDE-array ME2E models are shown to achieve superior streaming results in both ID and OOD tests. Realistic evaluations confirm the advantage of CUSIDE-array in its capability to consume single-channel data to improve OOD generalization via back-end pre-training and ME2E fine-tuning.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Brain Dialogue Interface (BDI): A User-Friendly fMRI Model for Interactive Brain Decoding
Authors:
Heng Huang,
Lin Zhao,
Zihao Wu,
Xiaowei Yu,
Jing Zhang,
Xintao Hu,
Dajiang Zhu,
Tianming Liu
Abstract:
Brain decoding techniques are essential for understanding the neurocognitive system. Although numerous methods have been introduced in this field, accurately aligning complex external stimuli with brain activities remains a formidable challenge. To alleviate alignment difficulties, many studies have simplified their models by employing single-task paradigms and establishing direct links between br…
▽ More
Brain decoding techniques are essential for understanding the neurocognitive system. Although numerous methods have been introduced in this field, accurately aligning complex external stimuli with brain activities remains a formidable challenge. To alleviate alignment difficulties, many studies have simplified their models by employing single-task paradigms and establishing direct links between brain/world through classification strategies. Despite improvements in decoding accuracy, this strategy frequently encounters issues with generality when adapting these models to various task paradigms. To address this issue, this study introduces a user-friendly decoding model that enables dynamic communication with the brain, as opposed to the static decoding approaches utilized by traditional studies. The model functions as a brain simulator, allowing for interactive engagement with the brain and enabling the decoding of a subject's experiences through dialogue-like queries. Uniquely, our model is trained in a completely unsupervised and task-free manner. Our experiments demonstrate the feasibility and versatility of our proposed method. Notably, our model demonstrates exceptional capabilities in signal compression, successfully representing the entire brain signal of approximately 185,751 voxels with just 32 signals. Furthermore, we show how our model can integrate seamlessly with multimodal models, thus enhancing the potential for controlling brain decoding through textual or image inputs.
△ Less
Submitted 17 June, 2024;
originally announced July 2024.
-
Beyond Image Prior: Embedding Noise Prior into Conditional Denoising Transformer
Authors:
Yuanfei Huang,
Hua Huang
Abstract:
Existing learning-based denoising methods typically train models to generalize the image prior from large-scale datasets, suffering from the variability in noise distributions encountered in real-world scenarios. In this work, we propose a new perspective on the denoising challenge by highlighting the distinct separation between noise and image priors. This insight forms the basis for our developm…
▽ More
Existing learning-based denoising methods typically train models to generalize the image prior from large-scale datasets, suffering from the variability in noise distributions encountered in real-world scenarios. In this work, we propose a new perspective on the denoising challenge by highlighting the distinct separation between noise and image priors. This insight forms the basis for our development of conditional optimization framework, designed to overcome the constraints of traditional denoising framework. To this end, we introduce a Locally Noise Prior Estimation (LoNPE) algorithm, which accurately estimates the noise prior directly from a single raw noisy image. This estimation acts as an explicit prior representation of the camera sensor's imaging environment, distinct from the image prior of scenes. Additionally, we design an auxiliary learnable LoNPE network tailored for practical application to sRGB noisy images. Leveraging the estimated noise prior, we present a novel Conditional Denoising Transformer (Condformer), by incorporating the noise prior into a conditional self-attention mechanism. This integration allows the Condformer to segment the optimization process into multiple explicit subspaces, significantly enhancing the model's generalization and flexibility. Extensive experimental evaluations on both synthetic and real-world datasets, demonstrate that the proposed method achieves superior performance over current state-of-the-art methods. The source code is available at https://github.com/YuanfeiHuang/Condformer.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Accurate Prior-centric Monocular Positioning with Offline LiDAR Fusion
Authors:
Jinhao He,
Huaiyang Huang,
Shuyang Zhang,
Jianhao Jiao,
Chengju Liu,
Ming Liu
Abstract:
Unmanned vehicles usually rely on Global Positioning System (GPS) and Light Detection and Ranging (LiDAR) sensors to achieve high-precision localization results for navigation purpose. However, this combination with their associated costs and infrastructure demands, poses challenges for widespread adoption in mass-market applications. In this paper, we aim to use only a monocular camera to achieve…
▽ More
Unmanned vehicles usually rely on Global Positioning System (GPS) and Light Detection and Ranging (LiDAR) sensors to achieve high-precision localization results for navigation purpose. However, this combination with their associated costs and infrastructure demands, poses challenges for widespread adoption in mass-market applications. In this paper, we aim to use only a monocular camera to achieve comparable onboard localization performance by tracking deep-learning visual features on a LiDAR-enhanced visual prior map. Experiments show that the proposed algorithm can provide centimeter-level global positioning results with scale, which is effortlessly integrated and favorable for low-cost robot system deployment in real-world applications.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification
Authors:
Yu Tian,
Congcong Wen,
Min Shi,
Muhammad Muneeb Afzal,
Hao Huang,
Muhammad Osama Khan,
Yan Luo,
Yi Fang,
Mengyu Wang
Abstract:
Addressing fairness in artificial intelligence (AI), particularly in medical AI, is crucial for ensuring equitable healthcare outcomes. Recent efforts to enhance fairness have introduced new methodologies and datasets in medical AI. However, the fairness issue under the setting of domain transfer is almost unexplored, while it is common that clinics rely on different imaging technologies (e.g., di…
▽ More
Addressing fairness in artificial intelligence (AI), particularly in medical AI, is crucial for ensuring equitable healthcare outcomes. Recent efforts to enhance fairness have introduced new methodologies and datasets in medical AI. However, the fairness issue under the setting of domain transfer is almost unexplored, while it is common that clinics rely on different imaging technologies (e.g., different retinal imaging modalities) for patient diagnosis. This paper presents FairDomain, a pioneering systemic study into algorithmic fairness under domain shifts, employing state-of-the-art domain adaptation (DA) and generalization (DG) algorithms for both medical segmentation and classification tasks to understand how biases are transferred between different domains. We also introduce a novel plug-and-play fair identity attention (FIA) module that adapts to various DA and DG algorithms to improve fairness by using self-attention to adjust feature importance based on demographic attributes. Additionally, we curate the first fairness-focused dataset with two paired imaging modalities for the same patient cohort on medical segmentation and classification tasks, to rigorously assess fairness in domain-shift scenarios. Excluding the confounding impact of demographic distribution variation between source and target domains will allow clearer quantification of the performance of domain transfer models. Our extensive evaluations reveal that the proposed FIA significantly enhances both model performance accounted for fairness across all domain shift settings (i.e., DA and DG) with respect to different demographics, which outperforms existing methods on both segmentation and classification. The code and data can be accessed at https://ophai.hms.harvard.edu/datasets/harvard-fairdomain20k.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Spin-valley-locked Electroluminescence for High-Performance Circularly-Polarized Organic Light-Emitting Diodes
Authors:
Yibo Deng,
Teng Long,
Pingyang Wang,
Han Huang,
Zijian Deng,
Chunling Gu,
Cunbin An,
Bo Liao,
Guillaume Malpuech,
Dmitry Solnyshkov,
Hongbing Fu,
Qing Liao
Abstract:
Circularly polarized (CP) organic light-emitting diodes (OLEDs) have attracted attention in potential applications including novel display and photonic technologies. However, conventional approaches cannot meet the requirements of device performance, such as high dissymmetry factor, high directionality, narrowband emission, simplified device structure and low costs. Here, we demonstrate spin-valle…
▽ More
Circularly polarized (CP) organic light-emitting diodes (OLEDs) have attracted attention in potential applications including novel display and photonic technologies. However, conventional approaches cannot meet the requirements of device performance, such as high dissymmetry factor, high directionality, narrowband emission, simplified device structure and low costs. Here, we demonstrate spin-valley-locked CP-OLEDs without chiral emitters, but based on photonic spin-orbit coupling, where photons with opposite CP characteristics are emitted from different optical valleys. These spin-valley locked OLEDs exhibit a narrowband emission of 16 nm, a high EQE of 3.65, a maximum luminance of near 98000 cd/m2 and a gEL of up to 1.80, which are among the best performances of active single-crystal CP-OLEDs, achieved with a simple device structure. This strategy opens an avenue for practical applications towards three-dimensional displays and on-chip CP-OLEDs.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
A Trustworthy AIoT-enabled Localization System via Federated Learning and Blockchain
Authors:
Junfei Wang,
He Huang,
Jingze Feng,
Steven Wong,
Lihua Xie,
Jianfei Yang
Abstract:
There is a significant demand for indoor localization technology in smart buildings, and the most promising solution in this field is using RF sensors and fingerprinting-based methods that employ machine learning models trained on crowd-sourced user data gathered from IoT devices. However, this raises security and privacy issues in practice. Some researchers propose to use federated learning to pa…
▽ More
There is a significant demand for indoor localization technology in smart buildings, and the most promising solution in this field is using RF sensors and fingerprinting-based methods that employ machine learning models trained on crowd-sourced user data gathered from IoT devices. However, this raises security and privacy issues in practice. Some researchers propose to use federated learning to partially overcome privacy problems, but there still remain security concerns, e.g., single-point failure and malicious attacks. In this paper, we propose a framework named DFLoc to achieve precise 3D localization tasks while considering the following two security concerns. Particularly, we design a specialized blockchain to decentralize the framework by distributing the tasks such as model distribution and aggregation which are handled by a central server to all clients in most previous works, to address the issue of the single-point failure for a reliable and accurate indoor localization system. Moreover, we introduce an updated model verification mechanism within the blockchain to alleviate the concern of malicious node attacks. Experimental results substantiate the framework's capacity to deliver accurate 3D location predictions and its superior resistance to the impacts of single-point failure and malicious attacks when compared to conventional centralized federated learning systems.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Random unitaries in extremely low depth
Authors:
Thomas Schuster,
Jonas Haferkamp,
Hsin-Yuan Huang
Abstract:
We prove that random quantum circuits on any geometry, including a 1D line, can form approximate unitary designs over $n$ qubits in $\log n$ depth. In a similar manner, we construct pseudorandom unitaries (PRUs) in 1D circuits in $\text{poly} \log n $ depth, and in all-to-all-connected circuits in $\text{poly} \log \log n $ depth. In all three cases, the $n$ dependence is optimal and improves expo…
▽ More
We prove that random quantum circuits on any geometry, including a 1D line, can form approximate unitary designs over $n$ qubits in $\log n$ depth. In a similar manner, we construct pseudorandom unitaries (PRUs) in 1D circuits in $\text{poly} \log n $ depth, and in all-to-all-connected circuits in $\text{poly} \log \log n $ depth. In all three cases, the $n$ dependence is optimal and improves exponentially over known results. These shallow quantum circuits have low complexity and create only short-range entanglement, yet are indistinguishable from unitaries with exponential complexity. Our construction glues local random unitaries on $\log n$-sized or $\text{poly} \log n$-sized patches of qubits to form a global random unitary on all $n$ qubits. In the case of designs, the local unitaries are drawn from existing constructions of approximate unitary $k$-designs, and hence also inherit an optimal scaling in $k$. In the case of PRUs, the local unitaries are drawn from existing unitary ensembles conjectured to form PRUs. Applications of our results include proving that classical shadows with 1D log-depth Clifford circuits are as powerful as those with deep circuits, demonstrating superpolynomial quantum advantage in learning low-complexity physical systems, and establishing quantum hardness for recognizing phases of matter with topological order.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
NoisyAG-News: A Benchmark for Addressing Instance-Dependent Noise in Text Classification
Authors:
Hongfei Huang,
Tingting Liang,
Xixi Sun,
Zikang Jin,
Yuyu Yin
Abstract:
Existing research on learning with noisy labels predominantly focuses on synthetic label noise. Although synthetic noise possesses well-defined structural properties, it often fails to accurately replicate real-world noise patterns. In recent years, there has been a concerted effort to construct generalizable and controllable instance-dependent noise datasets for image classification, significantl…
▽ More
Existing research on learning with noisy labels predominantly focuses on synthetic label noise. Although synthetic noise possesses well-defined structural properties, it often fails to accurately replicate real-world noise patterns. In recent years, there has been a concerted effort to construct generalizable and controllable instance-dependent noise datasets for image classification, significantly advancing the development of noise-robust learning in this area. However, studies on noisy label learning for text classification remain scarce. To better understand label noise in real-world text classification settings, we constructed the benchmark dataset NoisyAG-News through manual annotation. Initially, we analyzed the annotated data to gather observations about real-world noise. We qualitatively and quantitatively demonstrated that real-world noisy labels adhere to instance-dependent patterns. Subsequently, we conducted comprehensive learning experiments on NoisyAG-News and its corresponding synthetic noise datasets using pre-trained language models and noise-handling techniques. Our findings reveal that while pre-trained models are resilient to synthetic noise, they struggle against instance-dependent noise, with samples of varying confusion levels showing inconsistent performance during training and testing. These real-world noise patterns pose new, significant challenges, prompting a reevaluation of noisy label handling methods. We hope that NoisyAG-News will facilitate the development and evaluation of future solutions for learning with noisy labels.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Identifying \textit{doppelgänge} Black Holes through Shadow Images
Authors:
Yukun Xu,
Hyat Huang,
Meng-Yun Lai,
De-Cheng Zou
Abstract:
Recently, an interesting \textit{doppelgänge} black hole solution is obtained in the string-inspired Euler-Heisenberg theory, where the black holes have the same radii but share different charges. We found, however, they possess different ISCOs and photon spheres, and hence affect their shadow images. In this work, we investigate the optical appearances, illuminated by an optically and geometrical…
▽ More
Recently, an interesting \textit{doppelgänge} black hole solution is obtained in the string-inspired Euler-Heisenberg theory, where the black holes have the same radii but share different charges. We found, however, they possess different ISCOs and photon spheres, and hence affect their shadow images. In this work, we investigate the optical appearances, illuminated by an optically and geometrically thin disk, are investigated, of such black hole. One finds that doppelgänge black holes have different optical appearances. Even the horizon radii are the same, the size of shadows are not equal. Furthermore, we found that the large magnetic charge $Q_m$ black holes give rise to novel shadow images that the usual bright rings inside shadow are not clear, The optical appearances illuminated by spherically accretions are also examined, and it can also identify two doppelgänge black holes.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Particle-In-Cell simulations of filamentation process in magnetized plasma of capacitively-coupled radio-frequency discharge
Authors:
Huidong Huang,
Jian Chen,
Zhibin Wang
Abstract:
In the uniform raido-frequency capacitively-coupled plasma (RF-CCP) between a large electrode pair, adding an axial magnetic field induces diverse longitudinal filaments. This phenomenon, termed 'filamentation', challenges conventional understanding and remains poorly understood to date. To reveal its pattern dynamics, we conduct 2D Particle-In-Cell simulations to comprehensively examine whole pro…
▽ More
In the uniform raido-frequency capacitively-coupled plasma (RF-CCP) between a large electrode pair, adding an axial magnetic field induces diverse longitudinal filaments. This phenomenon, termed 'filamentation', challenges conventional understanding and remains poorly understood to date. To reveal its pattern dynamics, we conduct 2D Particle-In-Cell simulations to comprehensively examine whole process of filamentation, identifying two distinct stages. Initially, standing waves grows with a modulational instability, forming regular filaments. Subsequently, when initial wavenumber matching relation breaks, the plasma shifts towards dynamic regime governed by competition between Lorentz and thermal pressure forces, characterized by filaments' chaotic evolution. These novel clues pave the way to theoretically understanding the filamentation instability, and provides essential references in effectively manipulating the magnetized plasmas.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Probing the nature of the anticharmed-strange pentaquark states: mass spectra, decays, and magnetic moments
Authors:
Xuejie Liu,
Yue Tan,
Xiaoyun Chen,
Dianyong Chen,
Hongxia Huang,
Jialun Ping
Abstract:
Within the framework of the quark delocalization color screening model, a systematic investigation of the anticharmed-strange pentaquark system is performed using the resonance group method. The currently estimations predict three bound states with estimated masses to be 2886 MeV, 3039 MeV, and 3153 MeV, respectively. Additionally, three resonance states are identified in various scattering phase…
▽ More
Within the framework of the quark delocalization color screening model, a systematic investigation of the anticharmed-strange pentaquark system is performed using the resonance group method. The currently estimations predict three bound states with estimated masses to be 2886 MeV, 3039 MeV, and 3153 MeV, respectively. Additionally, three resonance states are identified in various scattering phase shifts processes. Among them, two resonance states $ΣD$ and $Σ^{\ast}D^{\ast}$ with quantum number $\frac{1}{2}(\frac{1}{2}^{-})$ are detected in channels $ND_{s}^{\ast}$ and $ND$, and $ΣD^{\ast}$ and $ΛD$, with masses and decay widths of ($M_{R}=3053\sim3055$ MeV, $T_{total}=13.0\sim13.4$ MeV) and ($M_{R}=3389\sim3390$ MeV, $T_{total}=10.4$ MeV), respectively. In the $ΛD^{\ast}$ and $ΣD^{\ast}$ channels, a resonance state with quantum number $\frac{1}{2}(\frac{3}{2}^{-})$ is discovered, with its mass and decay width being $3250\sim3252$ MeV and 4.4 MeV, respectively. These predicted pentaquark states have $\bar{c}snnn$ quark compositions, allowing them to be recognized as genuine pentaquark states. To validate these predictions, it is expected that upcoming experiments will further explore the predicted resonance and bound states in these possible decay channels.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Fréchet Distance in Subquadratic Time
Authors:
Siu-Wing Cheng,
Haoqiang Huang
Abstract:
Let $m$ and $n$ be the numbers of vertices of two polygonal curves in $\mathbb{R}^d$ for any fixed $d$ such that $m \leq n$. Since it was known in 1995 how to compute the Fréchet distance of these two curves in $O(mn\log (mn))$ time, it has been an open problem whether the running time can be reduced to $o(n^2)$ when $m = Ω(n)$. In the mean time, several well-known quadratic time barriers in compu…
▽ More
Let $m$ and $n$ be the numbers of vertices of two polygonal curves in $\mathbb{R}^d$ for any fixed $d$ such that $m \leq n$. Since it was known in 1995 how to compute the Fréchet distance of these two curves in $O(mn\log (mn))$ time, it has been an open problem whether the running time can be reduced to $o(n^2)$ when $m = Ω(n)$. In the mean time, several well-known quadratic time barriers in computational geometry have been overcome: 3SUM, some 3SUM-hard problems, and the computation of some distances between two polygonal curves, including the discrete Fréchet distance, the dynamic time warping distance, and the geometric edit distance. It is curious that the quadratic time barrier for Fréchet distance still stands. We present an algorithm to compute the Fréchet distance in $O(mn(\log\log n)^{2+μ}\log n/\log^{1+μ} m)$ expected time for some constant $μ\in (0,1)$. It is the first algorithm that returns the Fréchet distance in $o(mn)$ time when $m = Ω(n^{\varepsilon})$ for any fixed $\varepsilon \in (0,1]$.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Volume-optimal persistence homological scaffolds of hemodynamic networks covary with MEG theta-alpha aperiodic dynamics
Authors:
Nghi Nguyen,
Tao Hou,
Enrico Amico,
Jingyi Zheng,
Huajun Huang,
Alan D. Kaplan,
Giovanni Petri,
Joaquín Goñi,
Yize Zhao,
Duy Duong-Tran,
Li Shen
Abstract:
Higher-order properties of functional magnetic resonance imaging (fMRI) induced connectivity have been shown to unravel many exclusive topological and dynamical insights beyond pairwise interactions. Nonetheless, whether these fMRI-induced higher-order properties play a role in disentangling other neuroimaging modalities' insights remains largely unexplored and poorly understood. In this work, by…
▽ More
Higher-order properties of functional magnetic resonance imaging (fMRI) induced connectivity have been shown to unravel many exclusive topological and dynamical insights beyond pairwise interactions. Nonetheless, whether these fMRI-induced higher-order properties play a role in disentangling other neuroimaging modalities' insights remains largely unexplored and poorly understood. In this work, by analyzing fMRI data from the Human Connectome Project Young Adult dataset using persistent homology, we discovered that the volume-optimal persistence homological scaffolds of fMRI-based functional connectomes exhibited conservative topological reconfigurations from the resting state to attentional task-positive state. Specifically, while reflecting the extent to which each cortical region contributed to functional cycles following different cognitive demands, these reconfigurations were constrained such that the spatial distribution of cavities in the connectome is relatively conserved. Most importantly, such level of contributions covaried with powers of aperiodic activities mostly within the theta-alpha (4-12 Hz) band measured by magnetoencephalography (MEG). This comprehensive result suggests that fMRI-induced hemodynamics and MEG theta-alpha aperiodic activities are governed by the same functional constraints specific to each cortical morpho-structure. Methodologically, our work paves the way toward an innovative computing paradigm in multimodal neuroimaging topological learning.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
The rigorous derivation of Vlasov equations with local alignments from moderately interacting particle systems
Authors:
Jinhuan Wang,
Mengdi Zhuang,
Hui Huang
Abstract:
In this paper, we present a rigorous derivation of the mean-field limit for a moderately interacting particle system in $\R^d$ $(d\geq 2)$. For stochastic initial data, we demonstrate that the solution to the interacting particle model, with an appropriately applied cut-off, converges in probabilistic sense to the solution of the characteristics of the regularized Vlasov models featuring local ali…
▽ More
In this paper, we present a rigorous derivation of the mean-field limit for a moderately interacting particle system in $\R^d$ $(d\geq 2)$. For stochastic initial data, we demonstrate that the solution to the interacting particle model, with an appropriately applied cut-off, converges in probabilistic sense to the solution of the characteristics of the regularized Vlasov models featuring local alignments and Newtonian potential. Notably, the cutoff parameter for the singular potential is selected to scale polynomially with the number of particles, representing an improvement over the logarithmic cut-off obtained in [38].
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Benchmarking Complex Instruction-Following with Multiple Constraints Composition
Authors:
Bosi Wen,
Pei Ke,
Xiaotao Gu,
Lindong Wu,
Hao Huang,
Jinfeng Zhou,
Wenchuang Li,
Binxin Hu,
Wendy Gao,
Jiaxin Xu,
Yiming Liu,
Jie Tang,
Hongning Wang,
Minlie Huang
Abstract:
Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on m…
▽ More
Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints, which is an indispensable constituent in complex instructions. To this end, we propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints. We propose a hierarchical taxonomy for complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types, and manually collect a high-quality dataset accordingly. To make the evaluation reliable, we augment LLM-based evaluators with rules to effectively verify whether generated texts can satisfy each constraint and composition. Furthermore, we obtain the final evaluation score based on the dependency structure determined by different composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition.
△ Less
Submitted 11 July, 2024; v1 submitted 4 July, 2024;
originally announced July 2024.
-
GriDB: Scaling Blockchain Database via Sharding and Off-Chain Cross-Shard Mechanism
Authors:
Zicong Hong,
Song Guo,
Enyuan Zhou,
Wuhui Chen,
Huawei Huang,
Albert Zomaya
Abstract:
Blockchain databases have attracted widespread attention but suffer from poor scalability due to underlying non-scalable blockchains. While blockchain sharding is necessary for a scalable blockchain database, it poses a new challenge named on-chain cross-shard database services. Each cross-shard database service (e.g., cross-shard queries or inter-shard load balancing) involves massive cross-shard…
▽ More
Blockchain databases have attracted widespread attention but suffer from poor scalability due to underlying non-scalable blockchains. While blockchain sharding is necessary for a scalable blockchain database, it poses a new challenge named on-chain cross-shard database services. Each cross-shard database service (e.g., cross-shard queries or inter-shard load balancing) involves massive cross-shard data exchanges, while the existing cross-shard mechanisms need to process each cross-shard data exchange via the consensus of all nodes in the related shards (i.e., on-chain) to resist a Byzantine environment of blockchain, which eliminates sharding benefits. To tackle the challenge, this paper presents GriDB, the first scalable blockchain database, by designing a novel off-chain cross-shard mechanism for efficient cross-shard database services. Borrowing the idea of off-chain payments, GriDB delegates massive cross-shard data exchange to a few nodes, each of which is randomly picked from a different shard. Considering the Byzantine environment, the untrusted delegates cooperate to generate succinct proof for cross-shard data exchanges, while the consensus is only responsible for the low-cost proof verification. However, different from payments, the database services' verification has more requirements (e.g., completeness, correctness, freshness, and availability); thus, we introduce several new authenticated data structures (ADS). Particularly, we utilize consensus to extend the threat model and reduce the complexity of traditional accumulator-based ADS for verifiable cross-shard queries with a rich set of relational operators. Moreover, we study the necessity of inter-shard load balancing for a scalable blockchain database and design an off-chain and live approach for both efficiency and availability during balancing.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
High-temperature Superconductivity in Perovskite Hydride below 10 GPa
Authors:
Mingyang Du,
Hongyu Huang,
Zihan Zhang,
Min Wang,
Hao Song,
Defang Duan,
Tian Cui
Abstract:
Hydrogen and hydrides materials have long been considered promising materials for high-temperature superconductivity. But the extreme pressures required for the metallization of hydrogen-based superconductors limit their applications. Here, we have designed a series of high-temperature perovskite hydrides that can be stable within 10 GPa. Our research covered 182 ternary systems and ultimately det…
▽ More
Hydrogen and hydrides materials have long been considered promising materials for high-temperature superconductivity. But the extreme pressures required for the metallization of hydrogen-based superconductors limit their applications. Here, we have designed a series of high-temperature perovskite hydrides that can be stable within 10 GPa. Our research covered 182 ternary systems and ultimately determined that 9 compounds were stable within 20 GPa, of which 5 exhibited superconducting transition temperatures exceeding 120 K within 10 GPa. Excitingly, KGaH3 and CsInH3 are thermodynamically stable at 50 GPa. Among these perovskite hydrides, alkali metals are responsible for providing a fixed amount of charge and maintaining structural stability, while the cubic framework formed by IIIA group elements and hydrogen is crucial for high-temperature superconductivity. This work will inspire further experimental exploration and take an important step in the exploration of low-pressure stable high-temperature superconductors.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
OrbitGrasp: $SE(3)$-Equivariant Grasp Learning
Authors:
Boce Hu,
Xupeng Zhu,
Dian Wang,
Zihao Dong,
Haojie Huang,
Chenghao Wang,
Robin Walters,
Robert Platt
Abstract:
While grasp detection is an important part of any robotic manipulation pipeline, reliable and accurate grasp detection in $SE(3)$ remains a research challenge. Many robotics applications in unstructured environments such as the home or warehouse would benefit a lot from better grasp performance. This paper proposes a novel framework for detecting $SE(3)$ grasp poses based on point cloud input. Our…
▽ More
While grasp detection is an important part of any robotic manipulation pipeline, reliable and accurate grasp detection in $SE(3)$ remains a research challenge. Many robotics applications in unstructured environments such as the home or warehouse would benefit a lot from better grasp performance. This paper proposes a novel framework for detecting $SE(3)$ grasp poses based on point cloud input. Our main contribution is to propose an $SE(3)$-equivariant model that maps each point in the cloud to a continuous grasp quality function over the 2-sphere $S^2$ using a spherical harmonic basis. Compared with reasoning about a finite set of samples, this formulation improves the accuracy and efficiency of our model when a large number of samples would otherwise be needed. In order to accomplish this, we propose a novel variation on EquiFormerV2 that leverages a UNet-style backbone to enlarge the number of points the model can handle. Our resulting method, which we name $\textit{OrbitGrasp}$, significantly outperforms baselines in both simulation and physical experiments.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Properties of the QCD Matter -- An Experimental Review of Selected Results from RHIC BES Program
Authors:
Jinhui Chen,
Xin Dong,
Xionghong He,
Huanzhong Huang,
Feng Liu,
Xiaofeng Luo,
Yu-Gang Ma,
Lijuan Ruan,
Ming Shao,
Shusu Shi,
Xu Sun,
Aihong Tang,
Zebo Tang,
Fuqiang Wang,
Hai Wang,
Yi Wang,
Zhigang Xiao,
Guannan Xie,
Nu Xu,
Qinghua Xu,
Zhangbu Xu,
Chi Yang,
Shuai Yang,
Wangmei Zha,
Yapeng Zhang
, et al. (3 additional authors not shown)
Abstract:
In the paper, we discuss the development of the multi-gap resistive plate chamber Time-of-Flight (TOF) technology and the production of the STAR TOF detector in China at the beginning of the 21st century. Then we review recent experimental results from the first beam energy scan program (BES-I) at the Relativistic Heavy Ion Collider (RHIC). Topics cover measurements of collectivity, chirality, cri…
▽ More
In the paper, we discuss the development of the multi-gap resistive plate chamber Time-of-Flight (TOF) technology and the production of the STAR TOF detector in China at the beginning of the 21st century. Then we review recent experimental results from the first beam energy scan program (BES-I) at the Relativistic Heavy Ion Collider (RHIC). Topics cover measurements of collectivity, chirality, criticality, global polarization, strangeness, heavy-flavor, di-lepton and light nuclei productions.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval
Authors:
Jiexin Wang,
Xitong Luo,
Liuwen Cao,
Hongkui He,
Hailin Huang,
Jiayuan Xie,
Adam Jatowt,
Yi Cai
Abstract:
Large language models (LLMs) have brought significant advancements to code generation and code repair, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities. Despite numerous studies investigating the safety of code LLMs, there remains a gap…
▽ More
Large language models (LLMs) have brought significant advancements to code generation and code repair, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities. Despite numerous studies investigating the safety of code LLMs, there remains a gap in comprehensively addressing their security features. In this work, we aim to present a comprehensive study aimed at precisely evaluating and enhancing the security aspects of code LLMs. To support our research, we introduce CodeSecEval, a meticulously curated dataset designed to address 44 critical vulnerability types with 180 distinct samples. CodeSecEval serves as the foundation for the automatic evaluation of code models in two crucial tasks: code generation and code repair, with a strong emphasis on security. Our experimental results reveal that current models frequently overlook security issues during both code generation and repair processes, resulting in the creation of vulnerable code. In response, we propose different strategies that leverage vulnerability-aware information and insecure code explanations to mitigate these security vulnerabilities. Furthermore, our findings highlight that certain vulnerability types particularly challenge model performance, influencing their effectiveness in real-world applications. Based on these findings, we believe our study will have a positive impact on the software engineering community, inspiring the development of improved methods for training and utilizing LLMs, thereby leading to safer and more trustworthy model deployment.
△ Less
Submitted 4 July, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs
Authors:
Haodong Chen,
Haojian Huang,
Junhao Dong,
Mingzhe Zheng,
Dian Shao
Abstract:
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the scarcity of high-quality data, the insufficient utilization of facial dynamics, and the ambiguity of expression semantics, etc. To this end, we propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Re…
▽ More
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the scarcity of high-quality data, the insufficient utilization of facial dynamics, and the ambiguity of expression semantics, etc. To this end, we propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (FineCLIPER), incorporating the following novel designs: 1) To better distinguish between similar facial expressions, we extend the class labels to textual descriptions from both positive and negative aspects, and obtain supervision by calculating the cross-modal similarity based on the CLIP model; 2) Our FineCLIPER adopts a hierarchical manner to effectively mine useful cues from DFE videos. Specifically, besides directly embedding video frames as input (low semantic level), we propose to extract the face segmentation masks and landmarks based on each frame (middle semantic level) and utilize the Multi-modal Large Language Model (MLLM) to further generate detailed descriptions of facial changes across frames with designed prompts (high semantic level). Additionally, we also adopt Parameter-Efficient Fine-Tuning (PEFT) to enable efficient adaptation of large pre-trained models (i.e., CLIP) for this task. Our FineCLIPER achieves SOTA performance on the DFEW, FERV39k, and MAFW datasets in both supervised and zero-shot settings with few tunable parameters. Analysis and ablation studies further validate its effectiveness.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack
Authors:
Yan Yang,
Zeguan Xiao,
Xin Lu,
Hongru Wang,
Hailiang Huang,
Guanhua Chen,
Yun Chen
Abstract:
The widespread applications of large language models (LLMs) have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SoP, a simple yet effective framework to design jailbreak prompts automatically. I…
▽ More
The widespread applications of large language models (LLMs) have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SoP, a simple yet effective framework to design jailbreak prompts automatically. Inspired by the social facilitation concept, SoP generates and optimizes multiple jailbreak characters to bypass the guardrails of the target LLM. Different from previous work which relies on proprietary LLMs or seed jailbreak templates crafted by human expertise, SoP can generate and optimize the jailbreak prompt in a cold-start scenario using open-sourced LLMs without any seed jailbreak templates. Experimental results show that SoP achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the transferability of the generated templates across different LLMs and held-out malicious requests, while also exploring defense strategies against the jailbreak attack designed by SoP. Code is available at https://github.com/Yang-Yan-Yang-Yan/SoP.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Core Knowledge Learning Framework for Graph Adaptation and Scalability Learning
Authors:
Bowen Zhang,
Zhichao Huang,
Genan Dai,
Guangning Xu,
Xiaomao Fan,
Hu Huang
Abstract:
Graph classification is a pivotal challenge in machine learning, especially within the realm of graph-based data, given its importance in numerous real-world applications such as social network analysis, recommendation systems, and bioinformatics. Despite its significance, graph classification faces several hurdles, including adapting to diverse prediction tasks, training across multiple target do…
▽ More
Graph classification is a pivotal challenge in machine learning, especially within the realm of graph-based data, given its importance in numerous real-world applications such as social network analysis, recommendation systems, and bioinformatics. Despite its significance, graph classification faces several hurdles, including adapting to diverse prediction tasks, training across multiple target domains, and handling small-sample prediction scenarios. Current methods often tackle these challenges individually, leading to fragmented solutions that lack a holistic approach to the overarching problem. In this paper, we propose an algorithm aimed at addressing the aforementioned challenges. By incorporating insights from various types of tasks, our method aims to enhance adaptability, scalability, and generalizability in graph classification. Motivated by the recognition that the underlying subgraph plays a crucial role in GNN prediction, while the remainder is task-irrelevant, we introduce the Core Knowledge Learning (\method{}) framework for graph adaptation and scalability learning. \method{} comprises several key modules, including the core subgraph knowledge submodule, graph domain adaptation module, and few-shot learning module for downstream tasks. Each module is tailored to tackle specific challenges in graph classification, such as domain shift, label inconsistencies, and data scarcity. By learning the core subgraph of the entire graph, we focus on the most pertinent features for task relevance. Consequently, our method offers benefits such as improved model performance, increased domain adaptability, and enhanced robustness to domain variations. Experimental results demonstrate significant performance enhancements achieved by our method compared to state-of-the-art approaches.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Equivariant Diffusion Policy
Authors:
Dian Wang,
Stephen Hart,
David Surovik,
Tarik Kelestemur,
Haojie Huang,
Haibo Zhao,
Mark Yeatman,
Jiuguang Wang,
Robin Walters,
Robert Platt
Abstract:
Recent work has shown diffusion models are an effective approach to learning the multimodal distributions arising from demonstration data in behavior cloning. However, a drawback of this approach is the need to learn a denoising function, which is significantly more complex than learning an explicit policy. In this work, we propose Equivariant Diffusion Policy, a novel diffusion policy learning me…
▽ More
Recent work has shown diffusion models are an effective approach to learning the multimodal distributions arising from demonstration data in behavior cloning. However, a drawback of this approach is the need to learn a denoising function, which is significantly more complex than learning an explicit policy. In this work, we propose Equivariant Diffusion Policy, a novel diffusion policy learning method that leverages domain symmetries to obtain better sample efficiency and generalization in the denoising function. We theoretically analyze the $\mathrm{SO}(2)$ symmetry of full 6-DoF control and characterize when a diffusion model is $\mathrm{SO}(2)$-equivariant. We furthermore evaluate the method empirically on a set of 12 simulation tasks in MimicGen, and show that it obtains a success rate that is, on average, 21.9% higher than the baseline Diffusion Policy. We also evaluate the method on a real-world system to show that effective policies can be learned with relatively few training samples, whereas the baseline Diffusion Policy cannot.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
An Intelligent Robotic System for Perceptive Pancake Batter Stirring and Precise Pouring
Authors:
Xinyuan Luo,
Shengmiao Jin,
Hung-Jui Huang,
Wenzhen Yuan
Abstract:
Cooking robots have long been desired by the commercial market, while the technical challenge is still significant. A major difficulty comes from the demand of perceiving and handling liquid with different properties. This paper presents a robot system that mixes batter and makes pancakes out of it, where understanding and handling the viscous liquid is an essential component. The system integrate…
▽ More
Cooking robots have long been desired by the commercial market, while the technical challenge is still significant. A major difficulty comes from the demand of perceiving and handling liquid with different properties. This paper presents a robot system that mixes batter and makes pancakes out of it, where understanding and handling the viscous liquid is an essential component. The system integrates Haptic Sensing and control algorithms to autonomously stir flour and water to achieve the desired batter uniformity, estimate the batter's properties such as the water-flour ratio and liquid level, as well as perform precise manipulations to pour the batter into any specified shape. Experimental results show the system's capability to always produce batter of desired uniformity, estimate water-flour ratio and liquid level precisely, and accurately pour it into complex shapes. This research showcases the potential for robots to assist in kitchens and step towards commercial culinary automation.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining
Authors:
Qi Zhang,
Tianqi Du,
Haotian Huang,
Yifei Wang,
Yisen Wang
Abstract:
In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical co…
▽ More
In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical comparisons between two leading generative SSL paradigms: autoregressive SSL and masked SSL. Through establishing theoretical frameworks, we elucidate the strengths and limitations of autoregressive and masked SSL within the primary evaluation tasks of classification and content generation. Our findings demonstrate that in classification tasks, the flexibility of targeted tokens in masked SSL fosters more inter-sample connections compared to the fixed position of target tokens in autoregressive SSL, which yields superior clustering performance. In content generation tasks, the misalignment between the flexible lengths of test samples and the fixed length of unmasked texts in masked SSL (vs. flexible lengths of conditional texts in autoregressive SSL) hinders its generation performance. To leverage each other's strengths and mitigate weaknesses, we propose diversity-enhanced autoregressive and variable-length masked objectives, which substantially improve the classification performance of autoregressive SSL and the generation performance of masked SSL. Code is available at https://github.com/PKU-ML/LookAheadLookAround.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
DCI: An Accurate Quality Assessment Criteria for Protein Complex Structure Models
Authors:
Wenda Wang,
Jiaqi Zhai,
He Huang,
Xinqi Gong
Abstract:
The structure of proteins is the basis for studying protein function and drug design. The emergence of AlphaFold 2 has greatly promoted the prediction of protein 3D structures, and it is of great significance to give an overall and accurate evaluation of the predicted models, especially the complex models. Among the existing methods for evaluating multimer structures, DockQ is the most commonly us…
▽ More
The structure of proteins is the basis for studying protein function and drug design. The emergence of AlphaFold 2 has greatly promoted the prediction of protein 3D structures, and it is of great significance to give an overall and accurate evaluation of the predicted models, especially the complex models. Among the existing methods for evaluating multimer structures, DockQ is the most commonly used. However, as a more suitable metric for complex docking, DockQ cannot provide a unique and accurate evaluation in the non-docking situation. Therefore, it is necessary to propose an evaluation strategy that can directly evaluate the whole complex without limitation and achieve good results. In this work, we proposed DCI score, a new evaluation strategy for protein complex structure models, which only bases on distance map and CI (contact-interface) map, DCI focuses on the prediction accuracy of the contact interface based on the overall evaluation of complex structure, is not inferior to DockQ in the evaluation accuracy according to CAPRI classification, and is able to handle the non-docking situation better than DockQ. Besides, we calculated DCI score on CASP datasets and compared it with CASP official assessment, which obtained good results. In addition, we found that DCI can better evaluate the overall structure deviation caused by interface prediction errors in the case of multi-chains. Our DCI is available at \url{https://gitee.com/WendaWang/DCI-score.git}, and the online-server is available at \url{http://mialab.ruc.edu.cn/DCIServer/}.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
A Survey on Failure Analysis and Fault Injection in AI Systems
Authors:
Guangba Yu,
Gou Tan,
Haojia Huang,
Zhenyu Zhang,
Pengfei Chen,
Roberto Natella,
Zibin Zheng
Abstract:
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ens…
▽ More
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ensure resilience and reliability. Despite the importance of these techniques, there lacks a comprehensive review of FA and FI methodologies in AI systems. This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems. We systematically analyze 160 papers and repositories to answer three research questions including (1) what are the prevalent failures in AI systems, (2) what types of faults can current FI tools simulate, (3) what gaps exist between the simulated faults and real-world failures. Our findings reveal a taxonomy of AI system failures, assess the capabilities of existing FI tools, and highlight discrepancies between real-world and simulated failures. Moreover, this survey contributes to the field by providing a framework for fault diagnosis, evaluating the state-of-the-art in FI, and identifying areas for improvement in FI techniques to enhance the resilience of AI systems.
△ Less
Submitted 27 June, 2024;
originally announced July 2024.
-
BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
Authors:
Zhehuai Chen,
He Huang,
Oleksii Hrinchuk,
Krishna C. Puvvada,
Nithin Rao Koluguri,
Piotr Żelasko,
Jagadeesh Balam,
Boris Ginsburg
Abstract:
Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTO…
▽ More
Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities. Moreover, there is no clear streaming solution for either style, especially considering the solution should generalize to speech multitask. We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture. Hence we demonstrate the first open-source SpeechLLM solution that enables Streaming and Multitask at scale (beyond ASR) at the same time. This streamable solution achieves very strong performance on a wide range of speech tasks (ASR, AST, SQA, unseen DynamicSuperb). It is end-to-end optimizable, with lower training/inference cost, and demonstrates LLM knowledge transferability to speech.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Less is More: Accurate Speech Recognition & Translation without Web-Scale Data
Authors:
Krishna C. Puvvada,
Piotr Żelasko,
He Huang,
Oleksii Hrinchuk,
Nithin Rao Koluguri,
Kunal Dhawan,
Somshubra Majumdar,
Elena Rastorgueva,
Zhehuai Chen,
Vitaly Lavrukhin,
Jagadeesh Balam,
Boris Ginsburg
Abstract:
Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while b…
▽ More
Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
JuliVQC: an Efficient Variational Quantum Circuit Simulator for Near-Term Quantum Algorithms
Authors:
Wei-You Liao,
Xiang Wang,
Xiao-Yue Xu,
Chen Ding,
Shuo Zhang,
He-Liang Huang,
Chu Guo
Abstract:
We introduce JuliVQC: a light-weight, yet extremely efficient variational quantum circuit simulator. JuliVQC is part of an effort for classical simulation of the \textit{Zuchongzhi} quantum processors, where it is extensively used to characterize the circuit noises, as a building block in the Schr$\ddot{\text{o}}$dinger-Feynman algorithm for classical verification and performance benchmarking, and…
▽ More
We introduce JuliVQC: a light-weight, yet extremely efficient variational quantum circuit simulator. JuliVQC is part of an effort for classical simulation of the \textit{Zuchongzhi} quantum processors, where it is extensively used to characterize the circuit noises, as a building block in the Schr$\ddot{\text{o}}$dinger-Feynman algorithm for classical verification and performance benchmarking, and for variational optimization of the Fsim gate parameters. The design principle of JuliVQC is three-fold: (1) Transparent implementation of its core algorithms, realized by using the high-performance script language Julia; (2) Efficiency is the focus, with a cache-friendly implementation of each elementary operations and support for shared-memory parallelization; (3) Native support of automatic differentiation for both the noiseless and noisy quantum circuits. We perform extensive numerical experiments on JuliVQC in different application scenarios, including quantum circuits, variational quantum circuits and their noisy counterparts, which show that its performance is among the top of the popular alternatives.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
Authors:
Ke-Han Lu,
Zhehuai Chen,
Szu-Wei Fu,
He Huang,
Boris Ginsburg,
Yu-Chiang Frank Wang,
Hung-yi Lee
Abstract:
Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby fa…
▽ More
Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby facilitating the capability to understand both linguistic and non-linguistic features in speech. Enhanced with the proposed approach, our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. Moreover, we discover that the aligned model exhibits a zero-shot instruction-following capability without explicit speech instruction tuning. These findings highlight the potential to reshape instruction-following SLMs by incorporating rich, descriptive speech captions.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning
Authors:
Zhijie Nie,
Richong Zhang,
Zhangchi Feng,
Hailang Huang,
Xudong Liu
Abstract:
Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on l…
▽ More
Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on large-scale data have significantly improved retrieval tasks. However, these methods directly follow the existing pre-training methods in the cross-lingual or cross-modal domain, leading to two problems of inconsistency in CCR: The methods with cross-lingual style suffer from the intra-modal error propagation, resulting in inconsistent recall performance across languages in the whole dataset. The methods with cross-modal style suffer from the inter-modal optimization direction bias, resulting in inconsistent rank across languages within each instance, which cannot be reflected by Recall@K. To solve these problems, we propose a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias. In addition, we propose a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance. Extensive experiments on four CCR datasets show that our method improves both recall rates and MRV with smaller-scale pre-trained data, achieving the new state-of-art.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
A Refer-and-Ground Multimodal Large Language Model for Biomedicine
Authors:
Xiaoshuang Huang,
Haifeng Huang,
Lingdong Shen,
Yehui Yang,
Fangxin Shang,
Junwei Liu,
Jia Liu
Abstract:
With the rapid development of multimodal large language models (MLLMs), especially their capabilities in visual chat through refer and ground functionalities, their significance is increasingly recognized. However, the biomedical field currently exhibits a substantial gap in this area, primarily due to the absence of a dedicated refer and ground dataset for biomedical images. To address this chall…
▽ More
With the rapid development of multimodal large language models (MLLMs), especially their capabilities in visual chat through refer and ground functionalities, their significance is increasingly recognized. However, the biomedical field currently exhibits a substantial gap in this area, primarily due to the absence of a dedicated refer and ground dataset for biomedical images. To address this challenge, we devised the Med-GRIT-270k dataset. It comprises 270k question-and-answer pairs and spans eight distinct medical imaging modalities. Most importantly, it is the first dedicated to the biomedical domain and integrating refer and ground conversations. The key idea is to sample large-scale biomedical image-mask pairs from medical segmentation datasets and generate instruction datasets from text using chatGPT. Additionally, we introduce a Refer-and-Ground Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning. Extensive experiments have corroborated the efficacy of the Med-GRIT-270k dataset and the multi-modal, fine-grained interactive capabilities of the BiRD model. This holds significant reference value for the exploration and development of intelligent biomedical assistants.
△ Less
Submitted 28 June, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
A Cross Spatio-Temporal Pathology-based Lung Nodule Dataset
Authors:
Muwei Jian,
Haoran Zhang,
Mingju Shao,
Hongyu Chen,
Huihui Huang,
Yanjie Zhong,
Changlei Zhang,
Bin Wang,
Penghui Gao
Abstract:
Recently, intelligent analysis of lung nodules with the assistant of computer aided detection (CAD) techniques can improve the accuracy rate of lung cancer diagnosis. However, existing CAD systems and pulmonary datasets mainly focus on Computed Tomography (CT) images from one single period, while ignoring the cross spatio-temporal features associated with the progression of nodules contained in im…
▽ More
Recently, intelligent analysis of lung nodules with the assistant of computer aided detection (CAD) techniques can improve the accuracy rate of lung cancer diagnosis. However, existing CAD systems and pulmonary datasets mainly focus on Computed Tomography (CT) images from one single period, while ignoring the cross spatio-temporal features associated with the progression of nodules contained in imaging data from various captured periods of lung cancer. If the evolution patterns of nodules across various periods in the patients' CT sequences can be explored, it will play a crucial role in guiding the precise screening identification of lung cancer. Therefore, a cross spatio-temporal lung nodule dataset with pathological information for nodule identification and diagnosis is constructed, which contains 328 CT sequences and 362 annotated nodules from 109 patients. This comprehensive database is intended to drive research in the field of CAD towards more practical and robust methods, and also contribute to the further exploration of precision medicine related field. To ensure patient confidentiality, we have removed sensitive information from the dataset.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Authors:
Xiangyu Zhao,
Xiangtai Li,
Haodong Duan,
Haian Huang,
Yining Li,
Kai Chen,
Hua Yang
Abstract:
Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capa…
▽ More
Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model's performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code will be available at https://github.com/PhoenixZ810/MG-LLaVA.
△ Less
Submitted 26 June, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
Authors:
Cunchen Hu,
Heyang Huang,
Junhao Hu,
Jiang Xu,
Xusheng Chen,
Tao Xie,
Chenxi Wang,
Sa Wang,
Yungang Bao,
Ninghui Sun,
Yizhou Shan
Abstract:
Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemP…
▽ More
Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.
△ Less
Submitted 26 June, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling
Authors:
Minghui Fang,
Shengpeng Ji,
Jialong Zuo,
Hai Huang,
Yan Xia,
Jieming Zhu,
Xize Cheng,
Xiaoda Yang,
Wenrui Liu,
Gang Wang,
Zhenhua Dong,
Zhou Zhao
Abstract:
Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a sequence-to-sequence model to directly generate candidate identifiers based on natural language queries. Without explicitly computing the similarity between queries and candidates, generative retrieval surpasses dual-tower models in both speed and accuracy on large-scale corpora, providing new insights…
▽ More
Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a sequence-to-sequence model to directly generate candidate identifiers based on natural language queries. Without explicitly computing the similarity between queries and candidates, generative retrieval surpasses dual-tower models in both speed and accuracy on large-scale corpora, providing new insights for cross-modal retrieval. However, constructing identifiers for multimodal data remains an untapped problem, and the modality gap between natural language queries and multimodal candidates hinders retrieval performance due to the absence of additional encoders. To this end, we propose a pioneering generAtive Cross-modal rEtrieval framework (ACE), which is a comprehensive framework for end-to-end cross-modal retrieval based on coarse-to-fine semantic modeling. We propose combining K-Means and RQ-VAE to construct coarse and fine tokens, serving as identifiers for multimodal data. Correspondingly, we design the coarse-to-fine feature fusion strategy to efficiently align natural language queries and candidate identifiers. ACE is the first work to comprehensively demonstrate the feasibility of generative approach on text-to-image/audio/video retrieval, challenging the dominance of the embedding-based dual-tower architecture. Extensive experiments show that ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Gigantic-oxidative atomically layered epitaxy for designed complex oxides
Authors:
Guangdi Zhou,
Haoliang Huang,
Fengzhe Wang,
Heng Wang,
Qishuo Yang,
Zihao Nie,
Wei Lv,
Cui Ding,
Yueying Li,
Danfeng Li,
Yujie Sun,
Junhao Lin,
Guang-Ming Zhang,
Qi-Kun Xue,
Zhuoyu Chen
Abstract:
In designing material functionality within the intricate realm of transition metal oxides, lattice structure and d-orbital occupancy are two principal determinants of the correlated physical properties, such as superconductivity. However, the modulation of these two factors is inherently limited by the need to balance thermodynamic stability, kinetic mobility, and synthesis precision, particularly…
▽ More
In designing material functionality within the intricate realm of transition metal oxides, lattice structure and d-orbital occupancy are two principal determinants of the correlated physical properties, such as superconductivity. However, the modulation of these two factors is inherently limited by the need to balance thermodynamic stability, kinetic mobility, and synthesis precision, particularly for oxidation-demanding phases. We introduce a methodology, namely the gigantic-oxidative atomically layered epitaxy (GOAL-Epitaxy), enhancing oxidation power 3-4 orders of magnitude beyond oxide molecular beam epitaxy (OMBE) and pulsed laser deposition (PLD), while ensuring atomic-layer-by-layer growth of designed complex structures. Consequently, thermodynamic stability is markedly augmented at elevated temperatures, improving growth kinetics. We demonstrate the accurate synthesis of complex nickelates and cuprates, especially an artificially designed structure as a parent of high-temperature superconductivity, in which alternating single and double NiO2 layers possess distinct nominal d-orbital occupancy. The GOAL-Epitaxy enables material discovery within the vastly broadened growth parameter space.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
MLPHand: Real Time Multi-View 3D Hand Mesh Reconstruction via MLP Modeling
Authors:
Jian Yang,
Jiakun Li,
Guoming Li,
Zhen Shen,
Huai-Yu Wu,
Zhaoxin Fan,
Heng Huang
Abstract:
Multi-view hand mesh reconstruction is a critical task for applications in virtual reality and human-computer interaction, but it remains a formidable challenge. Although existing multi-view hand reconstruction methods achieve remarkable accuracy, they typically come with an intensive computational burden that hinders real-time inference. To this end, we propose MLPHand, a novel method designed fo…
▽ More
Multi-view hand mesh reconstruction is a critical task for applications in virtual reality and human-computer interaction, but it remains a formidable challenge. Although existing multi-view hand reconstruction methods achieve remarkable accuracy, they typically come with an intensive computational burden that hinders real-time inference. To this end, we propose MLPHand, a novel method designed for real-time multi-view single hand reconstruction. MLP Hand consists of two primary modules: (1) a lightweight MLP-based Skeleton2Mesh model that efficiently recovers hand meshes from hand skeletons, and (2) a multi-view geometry feature fusion prediction module that enhances the Skeleton2Mesh model with detailed geometric information from multiple views. Experiments on three widely used datasets demonstrate that MLPHand can reduce computational complexity by 90% while achieving comparable reconstruction accuracy to existing state-of-the-art baselines.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Open-vocabulary Pick and Place via Patch-level Semantic Maps
Authors:
Mingxi Jia,
Haojie Huang,
Zhewen Zhang,
Chenghao Wang,
Linfeng Zhao,
Dian Wang,
Jason Xinyu Liu,
Robin Walters,
Robert Platt,
Stefanie Tellex
Abstract:
Controlling robots through natural language instructions in open-vocabulary scenarios is pivotal for enhancing human-robot collaboration and complex robot behavior synthesis. However, achieving this capability poses significant challenges due to the need for a system that can generalize from limited data to a wide range of tasks and environments. Existing methods rely on large, costly datasets and…
▽ More
Controlling robots through natural language instructions in open-vocabulary scenarios is pivotal for enhancing human-robot collaboration and complex robot behavior synthesis. However, achieving this capability poses significant challenges due to the need for a system that can generalize from limited data to a wide range of tasks and environments. Existing methods rely on large, costly datasets and struggle with generalization. This paper introduces Grounded Equivariant Manipulation (GEM), a novel approach that leverages the generative capabilities of pre-trained vision-language models and geometric symmetries to facilitate few-shot and zero-shot learning for open-vocabulary robot manipulation tasks. Our experiments demonstrate GEM's high sample efficiency and superior generalization across diverse pick-and-place tasks in both simulation and real-world experiments, showcasing its ability to adapt to novel instructions and unseen objects with minimal data requirements. GEM advances a significant step forward in the domain of language-conditioned robot control, bridging the gap between semantic understanding and action generation in robotic systems.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.