subscribe to arXiv mailings

HGE: Embedding Temporal Knowledge Graphs in a Product Space of Heterogeneous Geometric Subspaces

Authors: Jiaxin Pan, Mojtaba Nayyeri, Yinan Li, Steffen Staab

Abstract: Temporal knowledge graphs represent temporal facts $(s,p,o,τ)$ relating a subject $s$ and an object $o$ via a relation label $p$ at time $τ$, where $τ$ could be a time point or time interval. Temporal knowledge graphs may exhibit static temporal patterns at distinct points in time and dynamic temporal patterns between different timestamps. In order to learn a rich set of static and dynamic tempora… ▽ More Temporal knowledge graphs represent temporal facts $(s,p,o,τ)$ relating a subject $s$ and an object $o$ via a relation label $p$ at time $τ$, where $τ$ could be a time point or time interval. Temporal knowledge graphs may exhibit static temporal patterns at distinct points in time and dynamic temporal patterns between different timestamps. In order to learn a rich set of static and dynamic temporal patterns and apply them for inference, several embedding approaches have been suggested in the literature. However, as most of them resort to single underlying embedding spaces, their capability to model all kinds of temporal patterns was severely limited by having to adhere to the geometric property of their one embedding space. We lift this limitation by an embedding approach that maps temporal facts into a product space of several heterogeneous geometric subspaces with distinct geometric properties, i.e.\ Complex, Dual, and Split-complex spaces. In addition, we propose a temporal-geometric attention mechanism to integrate information from different geometric subspaces conveniently according to the captured relational and temporal information. Experimental results on standard temporal benchmark datasets favorably evaluate our approach against state-of-the-art models. △ Less

Submitted 25 December, 2023; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: The 38th Annual AAAI Conference on Artificial Intelligence (AAAI'24)

arXiv:2312.12030 [pdf, other]

Towards Accurate Guided Diffusion Sampling through Symplectic Adjoint Method

Authors: Jiachun Pan, Hanshu Yan, Jun Hao Liew, Jiashi Feng, Vincent Y. F. Tan

Abstract: Training-free guided sampling in diffusion models leverages off-the-shelf pre-trained networks, such as an aesthetic evaluation model, to guide the generation process. Current training-free guided sampling algorithms obtain the guidance energy function based on a one-step estimate of the clean image. However, since the off-the-shelf pre-trained networks are trained on clean images, the one-step es… ▽ More Training-free guided sampling in diffusion models leverages off-the-shelf pre-trained networks, such as an aesthetic evaluation model, to guide the generation process. Current training-free guided sampling algorithms obtain the guidance energy function based on a one-step estimate of the clean image. However, since the off-the-shelf pre-trained networks are trained on clean images, the one-step estimation procedure of the clean image may be inaccurate, especially in the early stages of the generation process in diffusion models. This causes the guidance in the early time steps to be inaccurate. To overcome this problem, we propose Symplectic Adjoint Guidance (SAG), which calculates the gradient guidance in two inner stages. Firstly, SAG estimates the clean image via $n$ function calls, where $n$ serves as a flexible hyperparameter that can be tailored to meet specific image quality requirements. Secondly, SAG uses the symplectic adjoint method to obtain the gradients accurately and efficiently in terms of the memory requirements. Extensive experiments demonstrate that SAG generates images with higher qualities compared to the baselines in both guided image and video generation tasks. △ Less

Submitted 19 December, 2023; originally announced December 2023.

arXiv:2312.10997 [pdf, other]

Retrieval-Augmented Generation for Large Language Models: A Survey

Authors: Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang

Abstract: Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This enhances the accuracy and credibility of the generation, particularly for knowledge-inten… ▽ More Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This enhances the accuracy and credibility of the generation, particularly for knowledge-intensive tasks, and allows for continuous knowledge updates and integration of domain-specific information. RAG synergistically merges LLMs' intrinsic knowledge with the vast, dynamic repositories of external databases. This comprehensive review paper offers a detailed examination of the progression of RAG paradigms, encompassing the Naive RAG, the Advanced RAG, and the Modular RAG. It meticulously scrutinizes the tripartite foundation of RAG frameworks, which includes the retrieval, the generation and the augmentation techniques. The paper highlights the state-of-the-art technologies embedded in each of these critical components, providing a profound understanding of the advancements in RAG systems. Furthermore, this paper introduces up-to-date evaluation framework and benchmark. At the end, this article delineates the challenges currently faced and points out prospective avenues for research and development. △ Less

Submitted 27 March, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: Ongoing Work

arXiv:2312.08994 [pdf, other]

PANDA: Architecture-Level Power Evaluation by Unifying Analytical and Machine Learning Solutions

Authors: Qijun Zhang, Shiyu Li, Guanglei Zhou, Jingyu Pan, Chen-Chia Chang, Yiran Chen, Zhiyao Xie

Abstract: Power efficiency is a critical design objective in modern microprocessor design. To evaluate the impact of architectural-level design decisions, an accurate yet efficient architecture-level power model is desired. However, widely adopted data-independent analytical power models like McPAT and Wattch have been criticized for their unreliable accuracy. While some machine learning (ML) methods have b… ▽ More Power efficiency is a critical design objective in modern microprocessor design. To evaluate the impact of architectural-level design decisions, an accurate yet efficient architecture-level power model is desired. However, widely adopted data-independent analytical power models like McPAT and Wattch have been criticized for their unreliable accuracy. While some machine learning (ML) methods have been proposed for architecture-level power modeling, they rely on sufficient known designs for training and perform poorly when the number of available designs is limited, which is typically the case in realistic scenarios. In this work, we derive a general formulation that unifies existing architecture-level power models. Based on the formulation, we propose PANDA, an innovative architecture-level solution that combines the advantages of analytical and ML power models. It achieves unprecedented high accuracy on unknown new designs even when there are very limited designs for training, which is a common challenge in practice. Besides being an excellent power model, it can predict area, performance, and energy accurately. PANDA further supports power prediction for unknown new technology nodes. In our experiments, besides validating the superior performance and the wide range of functionalities of PANDA, we also propose an application scenario, where PANDA proves to identify high-performance design configurations given a power constraint. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Journal ref: IEEE/ACM International Conference on Computer-Aided Design (ICCAD) 2023

arXiv:2312.07839 [pdf, ps, other]

Minimax-optimal estimation for sparse multi-reference alignment with collision-free signals

Authors: Subhro Ghosh, Soumendu Sundar Mukherjee, Jing Bin Pan

Abstract: The Multi-Reference Alignment (MRA) problem aims at the recovery of an unknown signal from repeated observations under the latent action of a group of cyclic isometries, in the presence of additive noise of high intensity $σ$. It is a more tractable version of the celebrated cryo EM model. In the crucial high noise regime, it is known that its sample complexity scales as $σ^6$. Recent investigatio… ▽ More The Multi-Reference Alignment (MRA) problem aims at the recovery of an unknown signal from repeated observations under the latent action of a group of cyclic isometries, in the presence of additive noise of high intensity $σ$. It is a more tractable version of the celebrated cryo EM model. In the crucial high noise regime, it is known that its sample complexity scales as $σ^6$. Recent investigations have shown that for the practically significant setting of sparse signals, the sample complexity of the maximum likelihood estimator asymptotically scales with the noise level as $σ^4$. In this work, we investigate minimax optimality for signal estimation under the MRA model for so-called collision-free signals. In particular, this signal class covers the setting of generic signals of dilute sparsity (wherein the support size $s=O(L^{1/3})$, where $L$ is the ambient dimension. We demonstrate that the minimax optimal rate of estimation in for the sparse MRA problem in this setting is $σ^2/\sqrt{n}$, where $n$ is the sample size. In particular, this widely generalizes the sample complexity asymptotics for the restricted MLE in this setting, establishing it as the statistically optimal estimator. Finally, we demonstrate a concentration inequality for the restricted MLE on its deviations from the ground truth. △ Less

Submitted 12 December, 2023; originally announced December 2023.

arXiv:2312.05835 [pdf, other]

doi 10.3847/1538-4365/acfb03

Discovery and Timing of Millisecond Pulsars in the Globular Cluster M5 (NGC 5904) with FAST and Arecibo

Authors: Lei Zhang, Paulo C. C. Freire, Alessandro Ridolfi, Zhichen Pan, Jiaqi Zhao, Craig O. Heinke, Jianxing Chen, Mario Cadelano, Cristina Pallanca, Xian Hou, Xiaoting Fu, Shi Dai, Erbil Gugercinoglu, Meng Guo, Jason Hessels, Jiale Hu, Guodong Li, Mengmeng Ni, Jingshan Pan, Scott M. Ransom, Qitong Ruan, Ingrid Stairs, Chao-Wei Tsai, Pei Wang, Long Wang , et al. (7 additional authors not shown)

Abstract: We report on a comprehensive multi-wavelength study of the pulsars in the globular cluster (GC) M5, including the discovery of M5G, a new compact non-eclipsing "black widow" pulsar. Thanks to the analysis of 34 years of radio data taken with the FAST and Arecibo telescopes, we obtained new phase-connected timing solutions for four pulsars in the clusters and improved those of the other three known… ▽ More We report on a comprehensive multi-wavelength study of the pulsars in the globular cluster (GC) M5, including the discovery of M5G, a new compact non-eclipsing "black widow" pulsar. Thanks to the analysis of 34 years of radio data taken with the FAST and Arecibo telescopes, we obtained new phase-connected timing solutions for four pulsars in the clusters and improved those of the other three known pulsars. These have resulted in, among other things: a) much improved proper motions for five pulsars, with transverse velocities that are smaller than their respective escape velocities; b) 3-sigma and 1.5-sigma detections of Shapiro delays in M5F and M5D, respectively; c) greatly improved measurement of the periastron advance in M5B, whose value of 0.01361(6) implies that M5B is still likely to be a heavy neutron star. The binary pulsars M5D, E and F are confirmed to be in low-eccentricity binary systems, the low-mass companions of which are newly identified to be He white dwarfs using Hubble Space Telescope data. Four pulsars are also found to be associated with X-ray sources. Similarly to the eclipsing pulsar M5C, M5G shows little or no non-thermal X-ray emission, indicative of weak synchrotron radiation produced by intra-binary shocks. All the seven pulsars known in M5 have short spin periods and five are in binary systems with low orbital eccentricities. These characteristics differ from the overall GC pulsar population, but confirm the expectations for the pulsar population in a cluster with a small rate of stellar encounters per binary system. △ Less

Submitted 10 December, 2023; originally announced December 2023.

Journal ref: ApJS, 2013, 269:56

arXiv:2312.05177 [pdf, other]

doi 10.1088/1475-7516/2024/06/051

Compressed baryon acoustic oscillation analysis is robust to modified-gravity models

Authors: Jiaming Pan, Dragan Huterer, Felipe Andrade-Oliveira, Camille Avestruz

Abstract: We study the robustness of the baryon acoustic oscillation (BAO) analysis to the underlying cosmological model. We focus on testing the standard BAO analysis that relies on the use of a template. These templates are constructed assuming a fixed fiducial cosmological model and used to extract the location of the acoustic peaks. Such "compressed analysis" had been shown to be unbiased when applied t… ▽ More We study the robustness of the baryon acoustic oscillation (BAO) analysis to the underlying cosmological model. We focus on testing the standard BAO analysis that relies on the use of a template. These templates are constructed assuming a fixed fiducial cosmological model and used to extract the location of the acoustic peaks. Such "compressed analysis" had been shown to be unbiased when applied to the $Λ$CDM model and some of its extensions. However, it has not been known whether this type of analysis introduces biases in a wider range of cosmological models where the template may not fully capture relevant features in the BAO signal. In this study, we apply the compressed analysis to noiseless mock power spectra that are based on Horndeski models, a broad class of modified-gravity theories specified with eight additional free parameters. We study the precision and accuracy of the BAO peak-location extraction assuming DESI, DESI II, and MegaMapper survey specifications. We find that the bias in the extracted peak locations is negligible; for example, it is less than 10% of the statistical error for even the proposed future MegaMapper survey. Our findings indicate that the compressed BAO analysis is remarkably robust to the underlying cosmological model. △ Less

Submitted 26 June, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

Comments: 30 pages, 8 Figures, published version

Journal ref: Journal of Cosmology and Astroparticle Physics 06 (2024) 051

arXiv:2312.04965 [pdf, other]

Inversion-Free Image Editing with Natural Language

Authors: Sihan Xu, Yidong Huang, Jiayi Pan, Ziqiao Ma, Joyce Chai

Abstract: Despite recent advances in inversion-based editing, text-guided image manipulation remains challenging for diffusion models. The primary bottlenecks include 1) the time-consuming nature of the inversion process; 2) the struggle to balance consistency with accuracy; 3) the lack of compatibility with efficient consistency sampling methods used in consistency models. To address the above issues, we s… ▽ More Despite recent advances in inversion-based editing, text-guided image manipulation remains challenging for diffusion models. The primary bottlenecks include 1) the time-consuming nature of the inversion process; 2) the struggle to balance consistency with accuracy; 3) the lack of compatibility with efficient consistency sampling methods used in consistency models. To address the above issues, we start by asking ourselves if the inversion process can be eliminated for editing. We show that when the initial sample is known, a special variance schedule reduces the denoising step to the same form as the multi-step consistency sampling. We name this Denoising Diffusion Consistent Model (DDCM), and note that it implies a virtual inversion strategy without explicit inversion in sampling. We further unify the attention control mechanisms in a tuning-free framework for text-guided editing. Combining them, we present inversion-free editing (InfEdit), which allows for consistent and faithful editing for both rigid and non-rigid semantic changes, catering to intricate modifications without compromising on the image's integrity and explicit inversion. Through extensive experiments, InfEdit shows strong performance in various editing tasks and also maintains a seamless workflow (less than 3 seconds on one single A40), demonstrating the potential for real-time applications. Project Page: https://sled-group.github.io/InfEdit/ △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: Project Page: https://sled-group.github.io/InfEdit/

arXiv:2312.03788 [pdf, other]

SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM

Authors: Jiayi Pan, Chengcan Wang, Kaifu Zheng, Yangguang Li, Zhenyu Wang, Bin Feng

Abstract: Large language models (LLMs) have shown remarkable capabilities in various tasks. However their huge model size and the consequent demand for computational and memory resources also pose challenges to model deployment. Currently, 4-bit post-training quantization (PTQ) has achieved some success in LLMs, reducing the memory footprint by approximately 75% compared to FP16 models, albeit with some acc… ▽ More Large language models (LLMs) have shown remarkable capabilities in various tasks. However their huge model size and the consequent demand for computational and memory resources also pose challenges to model deployment. Currently, 4-bit post-training quantization (PTQ) has achieved some success in LLMs, reducing the memory footprint by approximately 75% compared to FP16 models, albeit with some accuracy loss. In this paper, we propose SmoothQuant+, an accurate and efficient 4-bit weight-only PTQ that requires no additional training, which enables lossless in accuracy for LLMs for the first time. Based on the fact that the loss of weight quantization is amplified by the activation outliers, SmoothQuant+ smoothes the activation outliers by channel before quantization, while adjusting the corresponding weights for mathematical equivalence, and then performs group-wise 4-bit weight quantization for linear layers. We have integrated SmoothQuant+ into the vLLM framework, an advanced high-throughput inference engine specially developed for LLMs, and equipped it with an efficient W4A16 CUDA kernels, so that vLLM can seamlessly support SmoothQuant+ 4-bit weight quantization. Our results show that, with SmoothQuant+, the Code Llama-34B model can be quantized and deployed on a A100 40GB GPU, achieving lossless accuracy and a throughput increase of 1.9 to 4.0 times compared to the FP16 model deployed on two A100 40GB GPUs. Moreover, the latency per token is only 68% of the FP16 model deployed on two A100 40GB GPUs. This is the state-of-the-art 4-bit weight quantization for LLMs as we know. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2312.01837 [pdf, other]

Prompting Disentangled Embeddings for Knowledge Graph Completion with Pre-trained Language Model

Authors: Yuxia Geng, Jiaoyan Chen, Yuhang Zeng, Zhuo Chen, Wen Zhang, Jeff Z. Pan, Yuxiang Wang, Xiaoliang Xu

Abstract: Both graph structures and textual information play a critical role in Knowledge Graph Completion (KGC). With the success of Pre-trained Language Models (PLMs) such as BERT, they have been applied for text encoding for KGC. However, the current methods mostly prefer to fine-tune PLMs, leading to huge training costs and limited scalability to larger PLMs. In contrast, we propose to utilize prompts a… ▽ More Both graph structures and textual information play a critical role in Knowledge Graph Completion (KGC). With the success of Pre-trained Language Models (PLMs) such as BERT, they have been applied for text encoding for KGC. However, the current methods mostly prefer to fine-tune PLMs, leading to huge training costs and limited scalability to larger PLMs. In contrast, we propose to utilize prompts and perform KGC on a frozen PLM with only the prompts trained. Accordingly, we propose a new KGC method named PDKGC with two prompts -- a hard task prompt which is to adapt the KGC task to the PLM pre-training task of token prediction, and a disentangled structure prompt which learns disentangled graph representation so as to enable the PLM to combine more relevant structure knowledge with the text information. With the two prompts, PDKGC builds a textual predictor and a structural predictor, respectively, and their combination leads to more comprehensive entity prediction. Solid evaluation on two widely used KGC datasets has shown that PDKGC often outperforms the baselines including the state-of-the-art, and its components are all effective. Our codes and data are available at https://github.com/genggengcss/PDKGC. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: under review

arXiv:2312.01677

Multi-task Image Restoration Guided By Robust DINO Features

Authors: Xin Lin, Chao Ren, Kelvin C. K. Chan, Lu Qi, Jinshan Pan, Ming-Hsuan Yang

Abstract: Multi-task image restoration has gained significant interest due to its inherent versatility and efficiency compared to its single-task counterpart. Despite its potential, performance degradation is observed with an increase in the number of tasks, primarily attributed to the distinct nature of each restoration task. Addressing this challenge, we introduce \mbox{\textbf{DINO-IR}}, a novel multi-ta… ▽ More Multi-task image restoration has gained significant interest due to its inherent versatility and efficiency compared to its single-task counterpart. Despite its potential, performance degradation is observed with an increase in the number of tasks, primarily attributed to the distinct nature of each restoration task. Addressing this challenge, we introduce \mbox{\textbf{DINO-IR}}, a novel multi-task image restoration approach leveraging robust features extracted from DINOv2. Our empirical analysis shows that while shallow features of DINOv2 capture rich low-level image characteristics, the deep features ensure a robust semantic representation insensitive to degradations while preserving high-frequency contour details. Building on these features, we devise specialized components, including multi-layer semantic fusion module, DINO-Restore adaption and fusion module, and DINO perception contrastive loss, to integrate DINOv2 features into the restoration paradigm. Equipped with the aforementioned components, our DINO-IR performs favorably against existing multi-task image restoration approaches in various tasks by a large margin, indicating the superiority and necessity of reinforcing the robust features for multi-task image restoration. △ Less

Submitted 5 December, 2023; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: Some important information need to add

arXiv:2312.01674 [pdf, other]

EDALearn: A Comprehensive RTL-to-Signoff EDA Benchmark for Democratized and Reproducible ML for EDA Research

Authors: Jingyu Pan, Chen-Chia Chang, Zhiyao Xie, Yiran Chen

Abstract: The application of Machine Learning (ML) in Electronic Design Automation (EDA) for Very Large-Scale Integration (VLSI) design has garnered significant research attention. Despite the requirement for extensive datasets to build effective ML models, most studies are limited to smaller, internally generated datasets due to the lack of comprehensive public resources. In response, we introduce EDALearn… ▽ More The application of Machine Learning (ML) in Electronic Design Automation (EDA) for Very Large-Scale Integration (VLSI) design has garnered significant research attention. Despite the requirement for extensive datasets to build effective ML models, most studies are limited to smaller, internally generated datasets due to the lack of comprehensive public resources. In response, we introduce EDALearn, the first holistic, open-source benchmark suite specifically for ML tasks in EDA. This benchmark suite presents an end-to-end flow from synthesis to physical implementation, enriching data collection across various stages. It fosters reproducibility and promotes research into ML transferability across different technology nodes. Accommodating a wide range of VLSI design instances and sizes, our benchmark aptly represents the complexity of contemporary VLSI designs. Additionally, we provide an in-depth data analysis, enabling users to fully comprehend the attributes and distribution of our data, which is essential for creating efficient ML models. Our contributions aim to encourage further advances in the ML-EDA domain. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: 8 pages

arXiv:2311.17532 [pdf, other]

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Authors: Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, Yike Guo

Abstract: Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition, the lack of large-scale available datasets with emo… ▽ More Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition, the lack of large-scale available datasets with emotional transition speech and corresponding 3D human gestures also limits the addressing of this task. To fulfill this goal, we first incorporate the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Considering obtaining the realistic 3D pose annotations corresponding to the dynamically inpainted emotion transition audio is extremely difficult, we propose a novel weakly supervised training strategy to encourage authority gesture transitions. Specifically, to enhance the coordination of transition gestures w.r.t different emotional ones, we model the temporal association representation between two different emotional gesture sequences as style guidance and infuse it into the transition generation. We further devise an emotion mixture mechanism that provides weak supervision based on a learnable mixed emotion label for transition gestures. Last, we present a keyframe sampler to supply effective initial posture cues in long sequences, enabling us to generate diverse gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts on our newly defined emotion transition task and datasets. Our code and dataset will be released on the project page: https://xingqunqi-lab.github.io/Emo-Transition-Gesture/. △ Less

Submitted 27 March, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: Accepted by CVPR 2024

arXiv:2311.17455 [pdf, other]

Experimental Generation of Spin-Photon Entanglement in Silicon Carbide

Authors: Ren-Zhou Fang, Xiao-Yi Lai, Tao Li, Ren-Zhu Su, Bo-Wei Lu, Chao-Wei Yang, Run-Ze Liu, Yu-Kun Qiao, Cheng Li, Zhi-Gang He, Jia Huang, Hao Li, Li-Xing You, Yong-Heng Huo, Xiao-Hui Bao, Jian-Wei Pan

Abstract: A solid-state approach for quantum networks is advantages, as it allows the integration of nanophotonics to enhance the photon emission and the utilization of weakly coupled nuclear spins for long-lived storage. Silicon carbide, specifically point defects within it, shows great promise in this regard due to the easy of availability and well-established nanofabrication techniques. Despite of remark… ▽ More A solid-state approach for quantum networks is advantages, as it allows the integration of nanophotonics to enhance the photon emission and the utilization of weakly coupled nuclear spins for long-lived storage. Silicon carbide, specifically point defects within it, shows great promise in this regard due to the easy of availability and well-established nanofabrication techniques. Despite of remarkable progresses made, achieving spin-photon entanglement remains a crucial aspect to be realized. In this paper, we experimentally generate entanglement between a silicon vacancy defect in silicon carbide and a scattered single photon in the zero-phonon line. The spin state is measured by detecting photons scattered in the phonon sideband. The photonic qubit is encoded in the time-bin degree-of-freedom and measured using an unbalanced Mach-Zehnder interferometer. Photonic correlations not only reveal the quality of the entanglement but also verify the deterministic nature of the entanglement creation process. By harnessing two pairs of such spin-photon entanglement, it becomes straightforward to entangle remote quantum nodes at long distance. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: 8 pages in total, 4 figures in the main text, 1 figure in the supplemental material

arXiv:2311.17366 [pdf, other]

Generative Hierarchical Temporal Transformer for Hand Action Recognition and Motion Prediction

Authors: Yilin Wen, Hao Pan, Takehiko Ohkawa, Lei Yang, Jia Pan, Yoichi Sato, Taku Komura, Wenping Wang

Abstract: We present a novel framework that concurrently tackles hand action recognition and 3D future hand motion prediction. While previous works focus on either recognition or prediction, we propose a generative Transformer VAE architecture to jointly capture both aspects, facilitating realistic motion prediction by leveraging the short-term hand motion and long-term action consistency observed across ti… ▽ More We present a novel framework that concurrently tackles hand action recognition and 3D future hand motion prediction. While previous works focus on either recognition or prediction, we propose a generative Transformer VAE architecture to jointly capture both aspects, facilitating realistic motion prediction by leveraging the short-term hand motion and long-term action consistency observed across timestamps. To ensure faithful representation of the semantic dependency and different temporal granularity of hand pose and action, our framework is decomposed into two cascaded VAE blocks. The lower pose block models short-span poses, while the upper action block models long-span action. These are connected by a mid-level feature that represents sub-second series of hand poses. Our framework is trained across multiple datasets, where pose and action blocks are trained separately to fully utilize pose-action annotations of different qualities. Evaluations show that on multiple datasets, the joint modeling of recognition and prediction improves over separate solutions, and the semantic and temporal hierarchy enables long-term pose and action modeling. △ Less

Submitted 24 December, 2023; v1 submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.15309 [pdf, other]

Deep Refinement-Based Joint Source Channel Coding over Time-Varying Channels

Authors: Junyu Pan, Hanlei Li, Guangyi Zhang, Yunlong Cai, Guanding Yu

Abstract: In recent developments, deep learning (DL)-based joint source-channel coding (JSCC) for wireless image transmission has made significant strides in performance enhancement. Nonetheless, the majority of existing DL-based JSCC methods are tailored for scenarios featuring stable channel conditions, notably a fixed signal-to-noise ratio (SNR). This specialization poses a limitation, as their performan… ▽ More In recent developments, deep learning (DL)-based joint source-channel coding (JSCC) for wireless image transmission has made significant strides in performance enhancement. Nonetheless, the majority of existing DL-based JSCC methods are tailored for scenarios featuring stable channel conditions, notably a fixed signal-to-noise ratio (SNR). This specialization poses a limitation, as their performance tends to wane in practical scenarios marked by highly dynamic channels, given that a fixed SNR inadequately represents the dynamic nature of such channels. In response to this challenge, we introduce a novel solution, namely deep refinement-based JSCC (DRJSCC). This innovative method is designed to seamlessly adapt to channels exhibiting temporal variations. By leveraging instantaneous channel state information (CSI), we dynamically optimize the encoding strategy through re-encoding the channel symbols. This dynamic adjustment ensures that the encoding strategy consistently aligns with the varying channel conditions during the transmission process. Specifically, our approach begins with the division of encoded symbols into multiple blocks, which are transmitted progressively to the receiver. In the event of changing channel conditions, we propose a mechanism to re-encode the remaining blocks, allowing them to adapt to the current channel conditions. Experimental results show that the DRJSCC scheme achieves comparable performance to the other mainstream DL-based JSCC models in stable channel conditions, and also exhibits great robustness against time-varying channels. △ Less

Submitted 26 November, 2023; originally announced November 2023.

arXiv:2311.11866 [pdf, other]

Analyzing Emissions and Energy Efficiency at Unsignalized Real-world Intersections Under Mixed Traffic Control

Authors: Michael Villarreal, Dawei Wang, Jia Pan, Weizi Li

Abstract: Greenhouse gas emissions have dramatically risen since the early 1900s with U.S. transportation generating 28% of U.S. emissions. As such, there is interest in reducing transportation-related emissions. Specifically, sustainability research has sprouted around signalized intersections as intersections allow different streams of traffic to cross and change directions. Recent research has developed… ▽ More Greenhouse gas emissions have dramatically risen since the early 1900s with U.S. transportation generating 28% of U.S. emissions. As such, there is interest in reducing transportation-related emissions. Specifically, sustainability research has sprouted around signalized intersections as intersections allow different streams of traffic to cross and change directions. Recent research has developed mixed traffic control eco-driving strategies at signalized intersections to decrease emissions. However, the inherent structure of a signalized intersection generates increased emissions by creating frequent acceleration/deceleration events, excessive idling from traffic congestion, and stop-and-go waves. Thus, we believe unsignalized intersections hold potential for further sustainability improvements. In this work, we provide an emissions analysis on unsignalized intersections with complex, real-world topologies and traffic demands where mixed traffic control strategies are employed by robot vehicles (RVs) to reduce wait times and congestion. We find with at least 10% RV penetration rate, RVs generate less fuel consumption, CO2 emissions, and NOx emissions than signalized intersections by up to 27%, 27% and 28%, respectively. With at least 30% RVs, CO and HC emissions are reduced by up to 42% and 43%, respectively. Additionally, RVs can reduce network-wide emissions despite only employing their strategies at intersections. △ Less

Submitted 17 January, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

Comments: Accepted to 4th IEEE Forum for Innovative Sustainable Transportation Systems

arXiv:2311.11551 [pdf, other]

Adapt in Contexts: Retrieval-Augmented Domain Adaptation via In-Context Learning

Authors: Quanyu Long, Wenya Wang, Sinno Jialin Pan

Abstract: Large language models (LLMs) have showcased their capability with few-shot inference known as in-context learning. However, in-domain demonstrations are not always readily available in real scenarios, leading to cross-domain in-context learning. Besides, LLMs are still facing challenges in long-tail knowledge in unseen and unfamiliar domains. The above limitations demonstrate the necessity of Unsu… ▽ More Large language models (LLMs) have showcased their capability with few-shot inference known as in-context learning. However, in-domain demonstrations are not always readily available in real scenarios, leading to cross-domain in-context learning. Besides, LLMs are still facing challenges in long-tail knowledge in unseen and unfamiliar domains. The above limitations demonstrate the necessity of Unsupervised Domain Adaptation (UDA). In this paper, we study the UDA problem under an in-context learning setting to adapt language models from the source domain to the target domain without any target labels. The core idea is to retrieve a subset of cross-domain elements that are the most similar to the query, and elicit language model to adapt in an in-context manner by learning both target domain distribution and the discriminative task signal simultaneously with the augmented cross-domain in-context examples. We devise different prompting and training strategies, accounting for different LM architectures to learn the target distribution via language modeling. With extensive experiments on Sentiment Analysis (SA) and Named Entity Recognition (NER) tasks, we thoroughly study the effectiveness of ICL for domain transfer and demonstrate significant improvements over baseline models. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: EMNLP 2023

arXiv:2311.11347 [pdf, other]

Large-scale Mixed Traffic Control Using Dynamic Vehicle Routing and Privacy-Preserving Crowdsourcing

Authors: Dawei Wang, Weizi Li, Jia Pan

Abstract: Controlling and coordinating urban traffic flow through robot vehicles is emerging as a novel transportation paradigm for the future. While this approach garners growing attention from researchers and practitioners, effectively managing and coordinating large-scale mixed traffic remains a challenge. We introduce an effective framework for large-scale mixed traffic control via privacy-preserving cr… ▽ More Controlling and coordinating urban traffic flow through robot vehicles is emerging as a novel transportation paradigm for the future. While this approach garners growing attention from researchers and practitioners, effectively managing and coordinating large-scale mixed traffic remains a challenge. We introduce an effective framework for large-scale mixed traffic control via privacy-preserving crowdsourcing and dynamic vehicle routing. Our framework consists of three modules: a privacy-protecting crowdsensing method, a graph propagation-based traffic forecasting method, and a privacy-preserving route selection mechanism. We evaluate our framework using a real-world road network. The results show that our framework accurately forecasts traffic flow, efficiently mitigates network-wide RV shortage issue, and coordinates large-scale mixed traffic. Compared to other baseline methods, our framework not only reduces the RV shortage issue up to 69.4% but also reduces the average waiting time of all vehicles in the network up to 27%. △ Less

Submitted 19 November, 2023; originally announced November 2023.

Comments: Accepted to IEEE Internet of Things Journal

arXiv:2311.10776 [pdf, other]

Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis

Authors: Kexin Chen, Junyou Li, Kunyi Wang, Yuyang Du, Jiahui Yu, Jiamin Lu, Lanqing Li, Jiezhong Qiu, Jianzhang Pan, Yi Huang, Qun Fang, Pheng Ann Heng, Guangyong Chen

Abstract: Recent AI research plots a promising future of automatic chemical reactions within the chemistry society. This study proposes Chemist-X, a transformative AI agent that automates the reaction condition recommendation (RCR) task in chemical synthesis with retrieval-augmented generation (RAG) technology. To emulate expert chemists' strategies when solving RCR tasks, Chemist-X utilizes advanced RAG sc… ▽ More Recent AI research plots a promising future of automatic chemical reactions within the chemistry society. This study proposes Chemist-X, a transformative AI agent that automates the reaction condition recommendation (RCR) task in chemical synthesis with retrieval-augmented generation (RAG) technology. To emulate expert chemists' strategies when solving RCR tasks, Chemist-X utilizes advanced RAG schemes to interrogate online molecular databases and distill critical data from the latest literature database. Further, the agent leverages state-of-the-art computer-aided design (CAD) tools with a large language model (LLM) supervised programming interface. With the ability to utilize updated chemical knowledge and CAD tools, our agent significantly outperforms conventional synthesis AIs confined to the fixed knowledge within its training data. Chemist-X considerably reduces chemists' workload and allows them to focus on more fundamental and creative problems, thereby bringing closer computational techniques and chemical research and making a remarkable leap toward harnessing AI's full capabilities in scientific discovery. △ Less

Submitted 4 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

arXiv:2311.10319 [pdf, other]

doi 10.1038/s41598-024-61822-9

Shifting to Machine Supervision: Annotation-Efficient Semi and Self-Supervised Learning for Automatic Medical Image Segmentation and Classification

Authors: Pranav Singh, Raviteja Chukkapalli, Shravan Chaudhari, Luoyao Chen, Mei Chen, Jinqian Pan, Craig Smuda, Jacopo Cirrone

Abstract: Advancements in clinical treatment are increasingly constrained by the limitations of supervised learning techniques, which depend heavily on large volumes of annotated data. The annotation process is not only costly but also demands substantial time from clinical specialists. Addressing this issue, we introduce the S4MI (Self-Supervision and Semi-Supervision for Medical Imaging) pipeline, a novel… ▽ More Advancements in clinical treatment are increasingly constrained by the limitations of supervised learning techniques, which depend heavily on large volumes of annotated data. The annotation process is not only costly but also demands substantial time from clinical specialists. Addressing this issue, we introduce the S4MI (Self-Supervision and Semi-Supervision for Medical Imaging) pipeline, a novel approach that leverages advancements in self-supervised and semi-supervised learning. These techniques engage in auxiliary tasks that do not require labeling, thus simplifying the scaling of machine supervision compared to fully-supervised methods. Our study benchmarks these techniques on three distinct medical imaging datasets to evaluate their effectiveness in classification and segmentation tasks. Notably, we observed that self supervised learning significantly surpassed the performance of supervised methods in the classification of all evaluated datasets. Remarkably, the semi-supervised approach demonstrated superior outcomes in segmentation, outperforming fully-supervised methods while using 50% fewer labels across all datasets. In line with our commitment to contributing to the scientific community, we have made the S4MI code openly accessible, allowing for broader application and further development of these methods. △ Less

Submitted 17 May, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

Comments: Seventeen pages (incl. references), five figures, and one table. Accepted and published in Scientific Reports 14.1 (2024): 10820

Journal ref: Singh, P., Chukkapalli, R., Chaudhari, S. et al. Shifting to machine supervision: annotation-efficient semi and self-supervised learning for automatic medical image segmentation and classification. Sci Rep 14, 10820 (2024)

arXiv:2311.08347 [pdf, other]

High-efficiency single-photon source above the loss-tolerant threshold for efficient linear optical quantum computing

Authors: Xing Ding, Yong-Peng Guo, Mo-Chi Xu, Run-Ze Liu, Geng-Yan Zou, Jun-Yi Zhao, Zhen-Xuan Ge, Qi-Hang Zhang, Hua-Liang Liu, Lin-Jun Wang, Ming-Cheng Chen, Hui Wang, Yu-Ming He, Yong-Heng Huo, Chao-Yang Lu, Jian-Wei Pan

Abstract: Photon loss is the biggest enemy for scalable photonic quantum information processing. This problem can be tackled by using quantum error correction, provided that the overall photon loss is below a threshold of 1/3. However, all reported on-demand and indistinguishable single-photon sources still fall short of this threshold. Here, by using tailor shaped laser pulse excitation on a high-quantum e… ▽ More Photon loss is the biggest enemy for scalable photonic quantum information processing. This problem can be tackled by using quantum error correction, provided that the overall photon loss is below a threshold of 1/3. However, all reported on-demand and indistinguishable single-photon sources still fall short of this threshold. Here, by using tailor shaped laser pulse excitation on a high-quantum efficiency single quantum dot deterministically coupled to a tunable open microcavity, we demonstrate a high-performance source with a single-photon purity of 0.9795(6), photon indistinguishability of 0.9856(13), and an overall system efficiency of 0.712(18), simultaneously. This source for the first time reaches the efficiency threshold for scalable photonic quantum computing. With this source, we further demonstrate 1.89(14) dB intensity squeezing, and consecutive 40-photon events with 1.67 mHz count rate. △ Less

Submitted 28 November, 2023; v1 submitted 14 November, 2023; originally announced November 2023.

arXiv:2311.07547 [pdf, other]

GPT-4V(ision) as A Social Media Analysis Engine

Authors: Hanjia Lyu, Jinfa Huang, Daoan Zhang, Yongsheng Yu, Xinyi Mou, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, Jiebo Luo

Abstract: Recent research has offered insights into the extraordinary capabilities of Large Multimodal Models (LMMs) in various general vision and language tasks. There is growing interest in how LMMs perform in more specialized domains. Social media content, inherently multimodal, blends text, images, videos, and sometimes audio. Understanding social multimedia content remains a challenging problem for con… ▽ More Recent research has offered insights into the extraordinary capabilities of Large Multimodal Models (LMMs) in various general vision and language tasks. There is growing interest in how LMMs perform in more specialized domains. Social media content, inherently multimodal, blends text, images, videos, and sometimes audio. Understanding social multimedia content remains a challenging problem for contemporary machine learning frameworks. In this paper, we explore GPT-4V(ision)'s capabilities for social multimedia analysis. We select five representative tasks, including sentiment analysis, hate speech detection, fake news identification, demographic inference, and political ideology detection, to evaluate GPT-4V. Our investigation begins with a preliminary quantitative analysis for each task using existing benchmark datasets, followed by a careful review of the results and a selection of qualitative samples that illustrate GPT-4V's potential in understanding multimodal social media content. GPT-4V demonstrates remarkable efficacy in these tasks, showcasing strengths such as joint understanding of image-text pairs, contextual and cultural awareness, and extensive commonsense knowledge. Despite the overall impressive capacity of GPT-4V in the social media domain, there remain notable challenges. GPT-4V struggles with tasks involving multilingual social multimedia comprehension and has difficulties in generalizing to the latest trends in social media. Additionally, it exhibits a tendency to generate erroneous information in the context of evolving celebrity and politician knowledge, reflecting the known hallucination problem. The insights gleaned from our findings underscore a promising future for LMMs in enhancing our comprehension of social media content and its users through the analysis of multimodal information. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.05261 [pdf]

RAGLog: Log Anomaly Detection using Retrieval Augmented Generation

Authors: Jonathan Pan, Swee Liang Wong, Yidi Yuan

Abstract: The ability to detect log anomalies from system logs is a vital activity needed to ensure cyber resiliency of systems. It is applied for fault identification or facilitate cyber investigation and digital forensics. However, as logs belonging to different systems and components differ significantly, the challenge to perform such analysis is humanly challenging from the volume, variety and velocity… ▽ More The ability to detect log anomalies from system logs is a vital activity needed to ensure cyber resiliency of systems. It is applied for fault identification or facilitate cyber investigation and digital forensics. However, as logs belonging to different systems and components differ significantly, the challenge to perform such analysis is humanly challenging from the volume, variety and velocity of logs. This is further complicated by the lack or unavailability of anomalous log entries to develop trained machine learning or artificial intelligence models for such purposes. In this research work, we explore the use of a Retrieval Augmented Large Language Model that leverages a vector database to detect anomalies from logs. We used a Question and Answer configuration pipeline. To the best of our knowledge, our experiment which we called RAGLog is a novel one and the experimental results show much promise. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2203.10960

arXiv:2311.02248 [pdf, other]

COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning

Authors: Jing Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu, Jinyu Li

Abstract: We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under 30 million trainable parameters… ▽ More We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under 30 million trainable parameters and only 450 hours of English speech data, COSMIC demonstrates emerging capabilities in instruction-following and in-context learning. Equipped with such capabilities, COSMIC achieves a maximum 33.18 BLEU score in 0-shot EN-to-X speech to text translation (S2TT) and a significant boost in the 1-shot setting. Additionally, there is an average 25.8\% relative Word Error Rate (WER) reduction for 1-shot cross-domain adaptation. COSMIC exhibits a significant automatic speech recognition (ASR) accuracy gain in contextual biasing tasks due to its instruction-following capability. △ Less

Submitted 14 June, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

arXiv:2311.00047 [pdf, other]

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Authors: Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, Joyce Chai

Abstract: Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human's perception of reality isn't always faithful to the physical world. This raises a key question: do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality? To investigate this questio… ▽ More Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human's perception of reality isn't always faithful to the physical world. This raises a key question: do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality? To investigate this question, we build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs. Our findings have shown that although the overall alignment is low, larger models are closer to human perception and more susceptible to visual illusions. Our dataset and initial findings will promote a better understanding of visual illusions in humans and machines and provide a stepping stone for future computational models that can better align humans and machines in perceiving and communicating about the shared visual world. The code and data are available at https://github.com/vl-illusion/dataset. △ Less

Submitted 31 October, 2023; originally announced November 2023.

Comments: Accepted at EMNLP 2023 main conference

arXiv:2310.19294 [pdf, other]

Dual-comb spectroscopy over 100km open-air path

Authors: Jin-Jian Han, Wei Zhong, Ruo-Can Zhao, Ting Zeng, Min Li, Jian Lu, Xin-Xin Peng, Xi-Ping Shi, Qin Yin, Yong Wang, Ali Esamdin, Qi Shen, Jian-Yu Guan, Lei Hou, Ji-Gang Ren, Jian-Jun Jia, Yu Wang, Hai-Feng Jiang, XiangHui Xue, Qiang Zhang, Xian-Kang Dou, Jian-Wei Pan

Abstract: Satellite-based greenhouse gases (GHG) sensing technologies play a critical role in the study of global carbon emissions and climate change. However, none of the existing satellite-based GHG sensing technologies can achieve the measurement of broad bandwidth, high temporal-spatial resolution, and high sensitivity at the same time. Recently, dual-comb spectroscopy (DCS) has been proposed as a super… ▽ More Satellite-based greenhouse gases (GHG) sensing technologies play a critical role in the study of global carbon emissions and climate change. However, none of the existing satellite-based GHG sensing technologies can achieve the measurement of broad bandwidth, high temporal-spatial resolution, and high sensitivity at the same time. Recently, dual-comb spectroscopy (DCS) has been proposed as a superior candidate technology for GHG sensing because it can measure broadband spectra with high temporal-spatial resolution and high sensitivity. The main barrier to DCS's display on satellites is its short measurement distance in open air achieved thus far. Prior research has not been able to implement DCS over 20 km of open-air path. Here, by developing a bistatic setup using time-frequency dissemination and high-power optical frequency combs, we have implemented DCS over a 113 km turbulent horizontal open-air path. Our experiment successfully measured GHG with 7 nm spectral bandwidth and a 10 kHz frequency and achieved a CO2 sensing precision of <2 ppm in 5 minutes and <0.6 ppm in 36 minutes. Our results represent a significant step towards advancing the implementation of DCS as a satellite-based technology and improving technologies for GHG monitoring △ Less

Submitted 31 October, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

Comments: 24 pages, 6 figures

arXiv:2310.19019 [pdf, other]

TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise

Authors: Nan He, Hanyu Lai, Chenyang Zhao, Zirui Cheng, Junting Pan, Ruoyu Qin, Ruofan Lu, Rui Lu, Yunchen Zhang, Gangming Zhao, Zhaohui Hou, Zhiyuan Huang, Shaoqing Lu, Ding Liang, Mingjie Zhan

Abstract: Large Language Models (LLMs) exhibit impressive reasoning and data augmentation capabilities in various NLP tasks. However, what about small models? In this work, we propose TeacherLM-7.1B, capable of annotating relevant fundamentals, chain of thought, and common mistakes for most NLP samples, which makes annotation more than just an answer, thus allowing other models to learn "why" instead of jus… ▽ More Large Language Models (LLMs) exhibit impressive reasoning and data augmentation capabilities in various NLP tasks. However, what about small models? In this work, we propose TeacherLM-7.1B, capable of annotating relevant fundamentals, chain of thought, and common mistakes for most NLP samples, which makes annotation more than just an answer, thus allowing other models to learn "why" instead of just "what". The TeacherLM-7.1B model achieved a zero-shot score of 52.3 on MMLU, surpassing most models with over 100B parameters. Even more remarkable is its data augmentation ability. Based on TeacherLM-7.1B, we augmented 58 NLP datasets and taught various student models with different parameters from OPT and BLOOM series in a multi-task setting. The experimental results indicate that the data augmentation provided by TeacherLM has brought significant benefits. We will release the TeacherLM series of models and augmented datasets as open-source. △ Less

Submitted 15 July, 2024; v1 submitted 29 October, 2023; originally announced October 2023.

Comments: 5 figures, 15 pages

arXiv:2310.18292 [pdf, other]

Twin-field quantum key distribution with local frequency reference

Authors: Jiu-Peng Chen, Fei Zhou, Chi Zhang, Cong Jiang, Fa-Xi Chen, Jia Huang, Hao Li, Li-Xing You, Xiang-Bin Wang, Yang Liu, Qiang Zhang, Jian-Wei Pan

Abstract: Twin-field quantum key distribution (TF-QKD) overcomes the linear rate-loss limit, which promises a boost of secure key rate over long distance. However, the complexity of eliminating the frequency differences between the independent laser sources hinders its practical application. Here, taking the saturated absorption spectroscopy of acetylene as an absolute reference, we propose and demonstrate… ▽ More Twin-field quantum key distribution (TF-QKD) overcomes the linear rate-loss limit, which promises a boost of secure key rate over long distance. However, the complexity of eliminating the frequency differences between the independent laser sources hinders its practical application. Here, taking the saturated absorption spectroscopy of acetylene as an absolute reference, we propose and demonstrate a simple and practical approach to realize TF-QKD without requiring relative frequency control of the independent laser sources. Adopting the 4-intensity sending-or-not-sending TF-QKD protocol, we experimentally demonstrate the TF-QKD over 502 km, 301 km and 201 km ultra-low loss optical fiber respectively. We expect this high-performance scheme will find widespread usage in future intercity and free-space quantum communication networks. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: 13 pages, 5 figures, 7 tables

Journal ref: Phys. Rev. Lett. 132, 260802 (2024)

arXiv:2310.17924 [pdf, other]

SOUL: Towards Sentiment and Opinion Understanding of Language

Authors: Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing

Abstract: Sentiment analysis is a well-established natural language processing task, with sentiment polarity classification being one of its most popular and representative tasks. However, despite the success of pre-trained language models in this area, they often fall short of capturing the broader complexities of sentiment analysis. To address this issue, we propose a new task called Sentiment and Opinion… ▽ More Sentiment analysis is a well-established natural language processing task, with sentiment polarity classification being one of its most popular and representative tasks. However, despite the success of pre-trained language models in this area, they often fall short of capturing the broader complexities of sentiment analysis. To address this issue, we propose a new task called Sentiment and Opinion Understanding of Language (SOUL). SOUL aims to evaluate sentiment understanding through two subtasks: Review Comprehension (RC) and Justification Generation (JG). RC seeks to validate statements that focus on subjective information based on a review text, while JG requires models to provide explanations for their sentiment predictions. To enable comprehensive evaluation, we annotate a new dataset comprising 15,028 statements from 3,638 reviews. Experimental results indicate that SOUL is a challenging task for both small and large language models, with a performance gap of up to 27% when compared to human performance. Furthermore, evaluations conducted with both human experts and GPT-4 highlight the limitations of the small language model in generating reasoning-based justifications. These findings underscore the challenging nature of the SOUL task for existing models, emphasizing the need for further advancements in sentiment analysis to address its complexities. The new dataset and code are available at https://github.com/DAMO-NLP-SG/SOUL. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: EMNLP 2023 Main Conference, Short Paper

arXiv:2310.17113 [pdf, other]

doi 10.1109/JSTQE.2023.3328870

Compact free-running InGaAs/InP single-photon detector with 40% detection efficiency and 2.3 kcps dark count rate

Authors: Qi Xu, Chao Yu, Wei Chen, Jianglin Zhao, Dajian Cui, Jun Zhang, Jian-Wei Pan

Abstract: Free-running InGaAs/InP single-photon detectors (SPDs) based on negative-feedback avalanche diodes (NFADs) are the key components for applications requiring asynchronous single-photon detection in the near-infrared region. From the perspective of practical applications, the features of SPDs in terms of high photon detection efficiency (PDE), low noise, large sensitive area, and compactness are hig… ▽ More Free-running InGaAs/InP single-photon detectors (SPDs) based on negative-feedback avalanche diodes (NFADs) are the key components for applications requiring asynchronous single-photon detection in the near-infrared region. From the perspective of practical applications, the features of SPDs in terms of high photon detection efficiency (PDE), low noise, large sensitive area, and compactness are highly desired for system integration and performance enhancement. Here, we present the implementation of a compact four-channel multimode fiber coupling free-running InGaAs/InP SPD, with the best overall performance to date. On the one hand, we design and fabricate structure-optimized InGaAs/InP NFAD devices with 25 $μ$m diameter active area and integrated thin film resistors to enhance the maximum achievable PDE. On the other hand, we apply a compact thermoacoustic cryocooler to regulate the operating temperature of NFADs within a large range, and design a dedicated readout circuit with minimized parasitic parameters and tunable settings of hold-off time to suppress the afterpulsing effect. The SPD is then characterized to achieve remarkable overall performance simultaneously at 1550 nm, i.e., 40% PDE, 2.3 kcps dark count rate, 8% afterpulse probability and 49 ps timing jitter (full width at half maximum) under the conditions of 5.9 V excess bias voltage, 10 $μ$s hold-off time and 213 K operation temperature. Such performance and the results of the long-term stability tests indicate that the SPD could be a favorable solution for practical applications. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: 7 pages, 7 figures. Accepted for publication in the IEEE Journal of Selected Topics in Quantum Electronics

Journal ref: IEEE Journal of Selected Topics in Quantum Electronics 30, 6400107 (2024)

arXiv:2310.16713 [pdf, other]

SkyMath: Technical Report

Authors: Liu Yang, Haihua Yang, Wenjun Cheng, Lei Lin, Chenxia Li, Yifu Chen, Lunan Liu, Jianfei Pan, Tianwen Wei, Biye Li, Liang Zhao, Lijie Wang, Bo Zhu, Guoliang Li, Xuejie Wu, Xilin Luo, Rui Hu

Abstract: Large language models (LLMs) have shown great potential to solve varieties of natural language processing (NLP) tasks, including mathematical reasoning. In this work, we present SkyMath, a large language model for mathematics with 13 billion parameters. By applying self-compare fine-tuning, we have enhanced mathematical reasoning abilities of Skywork-13B-Base remarkably. On GSM8K, SkyMath outperfo… ▽ More Large language models (LLMs) have shown great potential to solve varieties of natural language processing (NLP) tasks, including mathematical reasoning. In this work, we present SkyMath, a large language model for mathematics with 13 billion parameters. By applying self-compare fine-tuning, we have enhanced mathematical reasoning abilities of Skywork-13B-Base remarkably. On GSM8K, SkyMath outperforms all known open-source models of similar size and has established a new SOTA performance. △ Less

Submitted 26 October, 2023; v1 submitted 25 October, 2023; originally announced October 2023.

arXiv:2310.15590 [pdf, other]

Facial Data Minimization: Shallow Model as Your Privacy Filter

Authors: Yuwen Pu, Jiahao Chen, Jiayu Pan, Hao li, Diqun Yan, Xuhong Zhang, Shouling Ji

Abstract: Face recognition service has been used in many fields and brings much convenience to people. However, once the user's facial data is transmitted to a service provider, the user will lose control of his/her private data. In recent years, there exist various security and privacy issues due to the leakage of facial data. Although many privacy-preserving methods have been proposed, they usually fail w… ▽ More Face recognition service has been used in many fields and brings much convenience to people. However, once the user's facial data is transmitted to a service provider, the user will lose control of his/her private data. In recent years, there exist various security and privacy issues due to the leakage of facial data. Although many privacy-preserving methods have been proposed, they usually fail when they are not accessible to adversaries' strategies or auxiliary data. Hence, in this paper, by fully considering two cases of uploading facial images and facial features, which are very typical in face recognition service systems, we proposed a data privacy minimization transformation (PMT) method. This method can process the original facial data based on the shallow model of authorized services to obtain the obfuscated data. The obfuscated data can not only maintain satisfactory performance on authorized models and restrict the performance on other unauthorized models but also prevent original privacy data from leaking by AI methods and human visual theft. Additionally, since a service provider may execute preprocessing operations on the received data, we also propose an enhanced perturbation method to improve the robustness of PMT. Besides, to authorize one facial image to multiple service models simultaneously, a multiple restriction mechanism is proposed to improve the scalability of PMT. Finally, we conduct extensive experiments and evaluate the effectiveness of the proposed PMT in defending against face reconstruction, data abuse, and face attribute estimation attacks. These experimental results demonstrate that PMT performs well in preventing facial data abuse and privacy leakage while maintaining face recognition accuracy. △ Less

Submitted 12 November, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: 14 pages, 11 figures

arXiv:2310.15539 [pdf, other]

SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code Translation

Authors: Jialing Pan, Adrien Sadé, Jin Kim, Eric Soriano, Guillem Sole, Sylvain Flamant

Abstract: With the recent focus on Large Language Models (LLMs), both StarCoder (Li et al., 2023) and Code Llama (Rozière et al., 2023) have demonstrated remarkable performance in code generation. However, there is still a need for improvement in code translation functionality with efficient training techniques. In response to this, we introduce SteloCoder, a decoder-only StarCoder-based LLM designed specif… ▽ More With the recent focus on Large Language Models (LLMs), both StarCoder (Li et al., 2023) and Code Llama (Rozière et al., 2023) have demonstrated remarkable performance in code generation. However, there is still a need for improvement in code translation functionality with efficient training techniques. In response to this, we introduce SteloCoder, a decoder-only StarCoder-based LLM designed specifically for multi-programming language-to-Python code translation. In particular, SteloCoder achieves C++, C#, JavaScript, Java, or PHP-to-Python code translation without specifying the input programming language. We modified StarCoder model architecture by incorporating a Mixture-of-Experts (MoE) technique featuring five experts and a gating network for multi-task handling. Experts are obtained by StarCoder fine-tuning. Specifically, we use a Low-Rank Adaptive Method (LoRA) technique, limiting each expert size as only 0.06% of number of StarCoder's parameters. At the same time, to enhance training efficiency in terms of time, we adopt curriculum learning strategy and use self-instruct data for efficient fine-tuning. As a result, each expert takes only 6 hours to train on one single 80Gb A100 HBM. With experiments on XLCoST datasets, SteloCoder achieves an average of 73.76 CodeBLEU score in multi-programming language-to-Python translation, surpassing the top performance from the leaderboard by at least 3.5. This accomplishment is attributed to only 45M extra parameters with StarCoder as the backbone and 32 hours of valid training on one 80GB A100 HBM. The source code is release here: https://github.com/sade-adrien/SteloCoder. △ Less

Submitted 15 December, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

arXiv:2310.14050 [pdf, other]

Code-Switching with Word Senses for Pretraining in Neural Machine Translation

Authors: Vivek Iyer, Edoardo Barba, Alexandra Birch, Jeff Z. Pan, Roberto Navigli

Abstract: Lexical ambiguity is a significant and pervasive challenge in Neural Machine Translation (NMT), with many state-of-the-art (SOTA) NMT systems struggling to handle polysemous words (Campolungo et al., 2022). The same holds for the NMT pretraining paradigm of denoising synthetic "code-switched" text (Pan et al., 2021; Iyer et al., 2023), where word senses are ignored in the noising stage -- leading… ▽ More Lexical ambiguity is a significant and pervasive challenge in Neural Machine Translation (NMT), with many state-of-the-art (SOTA) NMT systems struggling to handle polysemous words (Campolungo et al., 2022). The same holds for the NMT pretraining paradigm of denoising synthetic "code-switched" text (Pan et al., 2021; Iyer et al., 2023), where word senses are ignored in the noising stage -- leading to harmful sense biases in the pretraining data that are subsequently inherited by the resulting models. In this work, we introduce Word Sense Pretraining for Neural Machine Translation (WSP-NMT) - an end-to-end approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases. Our experiments show significant improvements in overall translation quality. Then, we show the robustness of our approach to scale to various challenging data and resource-scarce scenarios and, finally, report fine-grained accuracy improvements on the DiBiMT disambiguation benchmark. Our studies yield interesting and novel insights into the merits and challenges of integrating word sense information and structured knowledge in multilingual pretraining for NMT. △ Less

Submitted 21 October, 2023; originally announced October 2023.

Comments: EMNLP (Findings) 2023 Long Paper

arXiv:2310.14024 [pdf, other]

Observation and quantification of pseudogap in unitary Fermi gases

Authors: Xi Li, Shuai Wang, Xiang Luo, Yu-Yang Zhou, Ke Xie, Hong-Chi Shen, Yu-Zhao Nie, Qijin Chen, Hui Hu, Yu-Ao Chen, Xing-Can Yao, Jian-Wei Pan

Abstract: The nature of pseudogap lies at the heart of strongly-interacting superconductivity and superfluidity. With known pairing interactions, unitary Fermi gases provide an ideal testbed to verify whether a pseudogap can arise from many-body pairing. Here we report the observation of the long-sought pair-fluctuation-driven pseudogap in homogeneous unitary Fermi gases of lithium-6 atoms, by precisely mea… ▽ More The nature of pseudogap lies at the heart of strongly-interacting superconductivity and superfluidity. With known pairing interactions, unitary Fermi gases provide an ideal testbed to verify whether a pseudogap can arise from many-body pairing. Here we report the observation of the long-sought pair-fluctuation-driven pseudogap in homogeneous unitary Fermi gases of lithium-6 atoms, by precisely measuring the spectral function through momentum-resolved microwave spectroscopy without the serious effects of final-state effect. We find a large pseudogap above the superfluid transition. The inverse pair lifetime exhibits a thermally-activated exponential behavior, uncovering the microscopic virtual pair breaking and recombination mechanism. The obtained large, T-independent single-particle scattering rate is comparable with that set by the Planckian limit. Our findings quantitatively characterize the pseudogap in strongly-interacting Fermi gases, highlighting the role of preformed pairing as a precursor to superfluidity. △ Less

Submitted 21 October, 2023; originally announced October 2023.

arXiv:2310.14021 [pdf, other]

Survey of Vector Database Management Systems

Authors: James Jie Pan, Jianguo Wang, Guoliang Li

Abstract: There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured da… ▽ More There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured data coupled with reliable, secure, fast, and scalable query processing capability. A variety of new data management techniques now exist for addressing these needs, however there is no comprehensive survey to thoroughly review these techniques and systems. We start by identifying five main obstacles to vector data management, namely vagueness of semantic similarity, large size of vectors, high cost of similarity comparison, lack of natural partitioning that can be used for indexing, and difficulty of efficiently answering hybrid queries that require both attributes and vectors. Overcoming these obstacles has led to new approaches to query processing, storage and indexing, and query optimization and execution. For query processing, a variety of similarity scores and query types are now well understood; for storage and indexing, techniques include vector compression, namely quantization, and partitioning based on randomization, learning partitioning, and navigable partitioning; for query optimization and execution, we describe new operators for hybrid queries, as well as techniques for plan enumeration, plan selection, and hardware accelerated execution. These techniques lead to a variety of VDBMSs across a spectrum of design and runtime characteristics, including native systems specialized for vectors and extended systems that incorporate vector capabilities into existing systems. We then discuss benchmarks, and finally we outline research challenges and point the direction for future work. △ Less

Submitted 21 October, 2023; originally announced October 2023.

Comments: 25 pages

arXiv:2310.12008 [pdf, other]

Multi-view Contrastive Learning for Entity Typing over Knowledge Graphs

Authors: Zhiwei Hu, Víctor Gutiérrez-Basulto, Zhiliang Xiang, Ru Li, Jeff Z. Pan

Abstract: Knowledge graph entity typing (KGET) aims at inferring plausible types of entities in knowledge graphs. Existing approaches to KGET focus on how to better encode the knowledge provided by the neighbors and types of an entity into its representation. However, they ignore the semantic knowledge provided by the way in which types can be clustered together. In this paper, we propose a novel method cal… ▽ More Knowledge graph entity typing (KGET) aims at inferring plausible types of entities in knowledge graphs. Existing approaches to KGET focus on how to better encode the knowledge provided by the neighbors and types of an entity into its representation. However, they ignore the semantic knowledge provided by the way in which types can be clustered together. In this paper, we propose a novel method called Multi-view Contrastive Learning for knowledge graph Entity Typing (MCLET), which effectively encodes the coarse-grained knowledge provided by clusters into entity and type embeddings. MCLET is composed of three modules: i) Multi-view Generation and Encoder module, which encodes structured information from entity-type, entity-cluster and cluster-type views; ii) Cross-view Contrastive Learning module, which encourages different views to collaboratively improve view-specific representations of entities and types; iii) Entity Typing Prediction module, which integrates multi-head attention and a Mixture-of-Experts strategy to infer missing entity types. Extensive experiments show the strong performance of MCLET compared to the state-of-the-art △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: Accepted at EMNLP 2023 Main

arXiv:2310.11676 [pdf, other]

PREM: A Simple Yet Effective Approach for Node-Level Graph Anomaly Detection

Authors: Junjun Pan, Yixin Liu, Yizhen Zheng, Shirui Pan

Abstract: Node-level graph anomaly detection (GAD) plays a critical role in identifying anomalous nodes from graph-structured data in various domains such as medicine, social networks, and e-commerce. However, challenges have arisen due to the diversity of anomalies and the dearth of labeled data. Existing methodologies - reconstruction-based and contrastive learning - while effective, often suffer from eff… ▽ More Node-level graph anomaly detection (GAD) plays a critical role in identifying anomalous nodes from graph-structured data in various domains such as medicine, social networks, and e-commerce. However, challenges have arisen due to the diversity of anomalies and the dearth of labeled data. Existing methodologies - reconstruction-based and contrastive learning - while effective, often suffer from efficiency issues, stemming from their complex objectives and elaborate modules. To improve the efficiency of GAD, we introduce a simple method termed PREprocessing and Matching (PREM for short). Our approach streamlines GAD, reducing time and memory consumption while maintaining powerful anomaly detection capabilities. Comprising two modules - a pre-processing module and an ego-neighbor matching module - PREM eliminates the necessity for message-passing propagation during training, and employs a simple contrastive loss, leading to considerable reductions in training time and memory usage. Moreover, through rigorous evaluations of five real-world datasets, our method demonstrated robustness and effectiveness. Notably, when validated on the ACM dataset, PREM achieved a 5% improvement in AUC, a 9-fold increase in training speed, and sharply reduce memory usage compared to the most efficient baseline. △ Less

Submitted 27 November, 2023; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: Accepted by IEEE International Conference of Data Mining 2023 (ICDM 2023)

arXiv:2310.11037 [pdf, ps, other]

Sampling for Remote Estimation of the Wiener Process over an Unreliable Channel

Authors: Jiayu Pan, Yin Sun, Ness B. Shroff

Abstract: In this paper, we study a sampling problem where a source takes samples from a Wiener process and transmits them through a wireless channel to a remote estimator. Due to channel fading, interference, and potential collisions, the packet transmissions are unreliable and could take random time durations. Our objective is to devise an optimal causal sampling policy that minimizes the long-term averag… ▽ More In this paper, we study a sampling problem where a source takes samples from a Wiener process and transmits them through a wireless channel to a remote estimator. Due to channel fading, interference, and potential collisions, the packet transmissions are unreliable and could take random time durations. Our objective is to devise an optimal causal sampling policy that minimizes the long-term average mean square estimation error. This optimal sampling problem is a recursive optimal stopping problem, which is generally quite difficult to solve. However, we prove that the optimal sampling strategy is, in fact, a simple threshold policy where a new sample is taken whenever the instantaneous estimation error exceeds a threshold. This threshold remains a constant value that does not vary over time. By exploring the structure properties of the recursive optimal stopping problem, a low-complexity iterative algorithm is developed to compute the optimal threshold. This work generalizes previous research by incorporating both transmission errors and random transmission times into remote estimation. Numerical simulations are provided to compare our optimal policy with the zero-wait and age-optimal policies. △ Less

Submitted 18 October, 2023; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: Accepted by ACM Sigmetrics, will appear in ACM POMACS journal

arXiv:2310.10365 [pdf, other]

doi 10.1103/PhysRevLett.131.133601

Berry Curvature and Bulk-Boundary Correspondence from Transport Measurement for Photonic Chern Bands

Authors: Chao Chen, Run-Ze Liu, Jizhou Wu, Zu-En Su, Xing Ding, Jian Qin, Lin Wang, Wei-Wei Zhang, Yu He, Xi-Lin Wang, Chao-Yang Lu, Li Li, Barry C. Sanders, Xiong-Jun Liu, Jian-Wei Pan

Abstract: Berry curvature is a fundamental element to characterize topological quantum physics, while a full measurement of Berry curvature in momentum space was not reported for topological states. Here we achieve two-dimensional Berry curvature reconstruction in a photonic quantum anomalous Hall system via Hall transport measurement of a momentum-resolved wave packet. Integrating measured Berry curvature… ▽ More Berry curvature is a fundamental element to characterize topological quantum physics, while a full measurement of Berry curvature in momentum space was not reported for topological states. Here we achieve two-dimensional Berry curvature reconstruction in a photonic quantum anomalous Hall system via Hall transport measurement of a momentum-resolved wave packet. Integrating measured Berry curvature over the two-dimensional Brillouin zone, we obtain Chern numbers corresponding to -1 and 0. Further, we identify bulk-boundary correspondence by measuring topology-linked chiral edge states at the boundary. The full topological characterization of photonic Chern bands from Berry curvature, Chern number, and edge transport measurements enables our photonic system to serve as a versatile platform for further in-depth study of novel topological physics. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Journal ref: Phys. Rev. Lett. 131, 133601 (25 September 2023)

arXiv:2310.09824 [pdf, other]

Overconstrained Robotic Limb with Energy-Efficient, Omni-directional Locomotion

Authors: Ronghan Xu, Jiayi Yin, Shihao Feng, Bangchao Huang, Haoran Sun, Jia Pan, Fang Wan, Chaoyang Song

Abstract: This paper studies the design, modeling, and control of a novel quadruped, featuring overconstrained robotic limbs employing the Bennett linkage for motion and power transmission. The modular limb design allows the robot to morph into reptile- or mammal-inspired forms. In contrast to the prevailing focus on planar limbs, this research delves into the classical overconstrained linkages, which have… ▽ More This paper studies the design, modeling, and control of a novel quadruped, featuring overconstrained robotic limbs employing the Bennett linkage for motion and power transmission. The modular limb design allows the robot to morph into reptile- or mammal-inspired forms. In contrast to the prevailing focus on planar limbs, this research delves into the classical overconstrained linkages, which have strong theoretical foundations in advanced kinematics but limited engineering applications. The study showcases the morphological superiority of overconstrained robotic limbs that can transform into planar or spherical limbs, exemplifying the Bennett linkage. By conducting kinematic and dynamic modeling, we apply model predictive control to simulate a range of locomotion tasks, revealing that overconstrained limbs outperform planar designs in omni-directional tasks like forward trotting, lateral trotting, and turning on the spot when considering foothold distances. These findings highlight the biological distinctions in limb design between reptiles and mammals and represent the first documented instance of overconstrained robotic limbs outperforming planar designs in dynamic locomotion. △ Less

Submitted 3 February, 2024; v1 submitted 15 October, 2023; originally announced October 2023.

Comments: 19 pages, 13 figures, 2 tables

arXiv:2310.08276 [pdf, other]

Direction-Oriented Visual-semantic Embedding Model for Remote Sensing Image-text Retrieval

Authors: Qing Ma, Jiancheng Pan, Cong Bai

Abstract: Image-text retrieval has developed rapidly in recent years. However, it is still a challenge in remote sensing due to visual-semantic imbalance, which leads to incorrect matching of non-semantic visual and textual features. To solve this problem, we propose a novel Direction-Oriented Visual-semantic Embedding Model (DOVE) to mine the relationship between vision and language. Our highlight is to co… ▽ More Image-text retrieval has developed rapidly in recent years. However, it is still a challenge in remote sensing due to visual-semantic imbalance, which leads to incorrect matching of non-semantic visual and textual features. To solve this problem, we propose a novel Direction-Oriented Visual-semantic Embedding Model (DOVE) to mine the relationship between vision and language. Our highlight is to conduct visual and textual representations in latent space, directing them as close as possible to a redundancy-free regional visual representation. Concretely, a Regional-Oriented Attention Module (ROAM) adaptively adjusts the distance between the final visual and textual embeddings in the latent semantic space, oriented by regional visual features. Meanwhile, a lightweight Digging Text Genome Assistant (DTGA) is designed to expand the range of tractable textual representation and enhance global word-level semantic connections using less attention operations. Ultimately, we exploit a global visual-semantic constraint to reduce single visual dependency and serve as an external constraint for the final visual and textual representations. The effectiveness and superiority of our method are verified by extensive experiments including parameter evaluation, quantitative comparison, ablation studies and visual analysis, on two benchmark datasets, RSICD and RSITMD. △ Less

Submitted 23 January, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

Comments: 14 pages, 11 figures

arXiv:2310.06654 [pdf, other]

Evaluating Explanation Methods for Vision-and-Language Navigation

Authors: Guanqi Chen, Lei Yang, Guanhua Chen, Jia Pan

Abstract: The ability to navigate robots with natural language instructions in an unknown environment is a crucial step for achieving embodied artificial intelligence (AI). With the improving performance of deep neural models proposed in the field of vision-and-language navigation (VLN), it is equally interesting to know what information the models utilize for their decision-making in the navigation tasks.… ▽ More The ability to navigate robots with natural language instructions in an unknown environment is a crucial step for achieving embodied artificial intelligence (AI). With the improving performance of deep neural models proposed in the field of vision-and-language navigation (VLN), it is equally interesting to know what information the models utilize for their decision-making in the navigation tasks. To understand the inner workings of deep neural models, various explanation methods have been developed for promoting explainable AI (XAI). But they are mostly applied to deep neural models for image or text classification tasks and little work has been done in explaining deep neural models for VLN tasks. In this paper, we address these problems by building quantitative benchmarks to evaluate explanation methods for VLN models in terms of faithfulness. We propose a new erasure-based evaluation pipeline to measure the step-wise textual explanation in the sequential decision-making setting. We evaluate several explanation methods for two representative VLN models on two popular VLN datasets and reveal valuable findings through our experiments. △ Less

Submitted 10 October, 2023; originally announced October 2023.

Comments: Accepted by ECAI 2023

arXiv:2310.06474 [pdf, other]

Multilingual Jailbreak Challenges in Large Language Models

Authors: Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing

Abstract: While large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, they pose potential safety concerns, such as the ``jailbreak'' problem, wherein malicious instructions can manipulate LLMs to exhibit undesirable behavior. Although several preventive measures have been developed to mitigate the potential risks associated with LLMs, they have primarily focused on Engli… ▽ More While large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, they pose potential safety concerns, such as the ``jailbreak'' problem, wherein malicious instructions can manipulate LLMs to exhibit undesirable behavior. Although several preventive measures have been developed to mitigate the potential risks associated with LLMs, they have primarily focused on English. In this study, we reveal the presence of multilingual jailbreak challenges within LLMs and consider two potential risky scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to deliberately attack LLMs. The experimental results reveal that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. Specifically, low-resource languages exhibit about three times the likelihood of encountering harmful content compared to high-resource languages, with both ChatGPT and GPT-4. In the intentional scenario, multilingual prompts can exacerbate the negative impact of malicious instructions, with astonishingly high rates of unsafe output: 80.92\% for ChatGPT and 40.71\% for GPT-4. To handle such a challenge in the multilingual context, we propose a novel \textsc{Self-Defense} framework that automatically generates multilingual training data for safety fine-tuning. Experimental results show that ChatGPT fine-tuned with such data can achieve a substantial reduction in unsafe content generation. Data is available at \url{https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs}. △ Less

Submitted 3 March, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

Comments: ICLR 2024

arXiv:2310.05128 [pdf, other]

Instances and Labels: Hierarchy-aware Joint Supervised Contrastive Learning for Hierarchical Multi-Label Text Classification

Authors: Simon Yu, Jie He, Víctor Gutiérrez-Basulto, Jeff Z. Pan

Abstract: Hierarchical multi-label text classification (HMTC) aims at utilizing a label hierarchy in multi-label classification. Recent approaches to HMTC deal with the problem of imposing an over-constrained premise on the output space by using contrastive learning on generated samples in a semi-supervised manner to bring text and label embeddings closer. However, the generation of samples tends to introdu… ▽ More Hierarchical multi-label text classification (HMTC) aims at utilizing a label hierarchy in multi-label classification. Recent approaches to HMTC deal with the problem of imposing an over-constrained premise on the output space by using contrastive learning on generated samples in a semi-supervised manner to bring text and label embeddings closer. However, the generation of samples tends to introduce noise as it ignores the correlation between similar samples in the same batch. One solution to this issue is supervised contrastive learning, but it remains an underexplored topic in HMTC due to its complex structured labels. To overcome this challenge, we propose $\textbf{HJCL}$, a $\textbf{H}$ierarchy-aware $\textbf{J}$oint Supervised $\textbf{C}$ontrastive $\textbf{L}$earning method that bridges the gap between supervised contrastive learning and HMTC. Specifically, we employ both instance-wise and label-wise contrastive learning techniques and carefully construct batches to fulfill the contrastive learning objective. Extensive experiments on four multi-path HMTC datasets demonstrate that HJCL achieves promising results and the effectiveness of Contrastive Learning on HMTC. △ Less

Submitted 19 June, 2024; v1 submitted 8 October, 2023; originally announced October 2023.

Comments: 18 pages; 10 figures. Published as a conference paper at EMNLP 2023 Findings (Long Paper). Code and data available at https://github.com/simonucl/HJCL

arXiv:2310.04747 [pdf, other]

Towards Dynamic and Small Objects Refinement for Unsupervised Domain Adaptative Nighttime Semantic Segmentation

Authors: Jingyi Pan, Sihang Li, Yucheng Chen, Jinjing Zhu, Lin Wang

Abstract: Nighttime semantic segmentation plays a crucial role in practical applications, such as autonomous driving, where it frequently encounters difficulties caused by inadequate illumination conditions and the absence of well-annotated datasets. Moreover, semantic segmentation models trained on daytime datasets often face difficulties in generalizing effectively to nighttime conditions. Unsupervised do… ▽ More Nighttime semantic segmentation plays a crucial role in practical applications, such as autonomous driving, where it frequently encounters difficulties caused by inadequate illumination conditions and the absence of well-annotated datasets. Moreover, semantic segmentation models trained on daytime datasets often face difficulties in generalizing effectively to nighttime conditions. Unsupervised domain adaptation (UDA) has shown the potential to address the challenges and achieved remarkable results for nighttime semantic segmentation. However, existing methods still face limitations in 1) their reliance on style transfer or relighting models, which struggle to generalize to complex nighttime environments, and 2) their ignorance of dynamic and small objects like vehicles and poles, which are difficult to be directly learned from other domains. This paper proposes a novel UDA method that refines both label and feature levels for dynamic and small objects for nighttime semantic segmentation. First, we propose a dynamic and small object refinement module to complement the knowledge of dynamic and small objects from the source domain to target the nighttime domain. These dynamic and small objects are normally context-inconsistent in under-exposed conditions. Then, we design a feature prototype alignment module to reduce the domain gap by deploying contrastive learning between features and prototypes of the same class from different domains, while re-weighting the categories of dynamic and small objects. Extensive experiments on three benchmark datasets demonstrate that our method outperforms prior arts by a large margin for nighttime segmentation. Project page: https://rorisis.github.io/DSRNSS/. △ Less

Submitted 14 March, 2024; v1 submitted 7 October, 2023; originally announced October 2023.

arXiv:2310.04721 [pdf, other]

Memory-Constrained Semantic Segmentation for Ultra-High Resolution UAV Imagery

Authors: Qi Li, Jiaxin Cai, Yuanlong Yu, Jason Gu, Jia Pan, Wenxi Liu

Abstract: Amidst the swift advancements in photography and sensor technologies, high-definition cameras have become commonplace in the deployment of Unmanned Aerial Vehicles (UAVs) for diverse operational purposes. Within the domain of UAV imagery analysis, the segmentation of ultra-high resolution images emerges as a substantial and intricate challenge, especially when grappling with the constraints impose… ▽ More Amidst the swift advancements in photography and sensor technologies, high-definition cameras have become commonplace in the deployment of Unmanned Aerial Vehicles (UAVs) for diverse operational purposes. Within the domain of UAV imagery analysis, the segmentation of ultra-high resolution images emerges as a substantial and intricate challenge, especially when grappling with the constraints imposed by GPU memory-restricted computational devices. This paper delves into the intricate problem of achieving efficient and effective segmentation of ultra-high resolution UAV imagery, while operating under stringent GPU memory limitation. The strategy of existing approaches is to downscale the images to achieve computationally efficient segmentation. However, this strategy tends to overlook smaller, thinner, and curvilinear regions. To address this problem, we propose a GPU memory-efficient and effective framework for local inference without accessing the context beyond local patches. In particular, we introduce a novel spatial-guided high-resolution query module, which predicts pixel-wise segmentation results with high quality only by querying nearest latent embeddings with the guidance of high-resolution information. Additionally, we present an efficient memory-based interaction scheme to correct potential semantic bias of the underlying high-resolution information by associating cross-image contextual semantics. For evaluation of our approach, we perform comprehensive experiments over public benchmarks and achieve superior performance under both conditions of small and large GPU memory usage limitations. We will release the model and codes in the future. △ Less

Submitted 7 October, 2023; originally announced October 2023.

arXiv:2310.04400 [pdf, other]

On the Embedding Collapse when Scaling up Recommendation Models

Authors: Xingzhuo Guo, Junwei Pan, Ximei Wang, Baixu Chen, Jie Jiang, Mingsheng Long

Abstract: Recent advances in foundation models have led to a promising trend of developing large recommendation models to leverage vast amounts of available data. Still, mainstream models remain embarrassingly small in size and naïve enlarging does not lead to sufficient performance gain, suggesting a deficiency in the model scalability. In this paper, we identify the embedding collapse phenomenon as the in… ▽ More Recent advances in foundation models have led to a promising trend of developing large recommendation models to leverage vast amounts of available data. Still, mainstream models remain embarrassingly small in size and naïve enlarging does not lead to sufficient performance gain, suggesting a deficiency in the model scalability. In this paper, we identify the embedding collapse phenomenon as the inhibition of scalability, wherein the embedding matrix tends to occupy a low-dimensional subspace. Through empirical and theoretical analysis, we demonstrate a \emph{two-sided effect} of feature interaction specific to recommendation models. On the one hand, interacting with collapsed embeddings restricts embedding learning and exacerbates the collapse issue. On the other hand, interaction is crucial in mitigating the fitting of spurious features as a scalability guarantee. Based on our analysis, we propose a simple yet effective multi-embedding design incorporating embedding-set-specific interaction modules to learn embedding sets with large diversity and thus reduce collapse. Extensive experiments demonstrate that this proposed design provides consistent scalability and effective collapse mitigation for various recommendation models. Code is available at this repository: https://github.com/thuml/Multi-Embedding. △ Less

Submitted 6 June, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: ICML 2024 Accepted

arXiv:2310.04399 [pdf, other]

Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach

Authors: Junkun Chen, Jian Xue, Peidong Wang, Jing Pan, Jinyu Li

Abstract: Simultaneous Speech-to-Text translation serves a critical role in real-time crosslingual communication. Despite the advancements in recent years, challenges remain in achieving stability in the translation process, a concern primarily manifested in the flickering of partial results. In this paper, we propose a novel revision-controllable method designed to address this issue. Our method introduces… ▽ More Simultaneous Speech-to-Text translation serves a critical role in real-time crosslingual communication. Despite the advancements in recent years, challenges remain in achieving stability in the translation process, a concern primarily manifested in the flickering of partial results. In this paper, we propose a novel revision-controllable method designed to address this issue. Our method introduces an allowed revision window within the beam search pruning process to screen out candidate translations likely to cause extensive revisions, leading to a substantial reduction in flickering and, crucially, providing the capability to completely eliminate flickering. The experiments demonstrate the proposed method can significantly improve the decoding stability without compromising substantially on the translation quality. △ Less

Submitted 6 October, 2023; originally announced October 2023.

Comments: accepted by ASRU 2023

Showing 151–200 of 1,383 results for author: Pan, J