subscribe to arXiv mailings

Photocathode characterisation for robust PICOSEC Micromegas precise-timing detectors

Authors: M. Lisowska, R. Aleksan, Y. Angelis, S. Aune, J. Bortfeldt, F. Brunbauer, M. Brunoldi, E. Chatzianagnostou, J. Datta, K. Dehmelt, G. Fanourakis, S. Ferry, D. Fiorina, K. J. Floethner, M. Gallinaro, F. Garcia, I. Giomataris, K. Gnanvo, F. J. Iguaz, D. Janssens, A. Kallitsopoulou, M. Kovacic, B. Kross, C. C. Lai, P. Legou , et al. (33 additional authors not shown)

Abstract: The PICOSEC Micromegas detector is a precise-timing gaseous detector based on a Cherenkov radiator coupled with a semi-transparent photocathode and a Micromegas amplifying structure, targeting a time resolution of tens of picoseconds for minimum ionising particles. Initial single-pad prototypes have demonstrated a time resolution below 25 ps, prompting ongoing developments to adapt the concept for… ▽ More The PICOSEC Micromegas detector is a precise-timing gaseous detector based on a Cherenkov radiator coupled with a semi-transparent photocathode and a Micromegas amplifying structure, targeting a time resolution of tens of picoseconds for minimum ionising particles. Initial single-pad prototypes have demonstrated a time resolution below 25 ps, prompting ongoing developments to adapt the concept for applications. The achieved performance is being transferred to robust multi-channel detector modules suitable for large-area detection systems requiring excellent timing precision. To enhance the robustness and stability of the PICOSEC Micromegas detector, research on robust carbon-based photocathodes, including Diamond-Like Carbon (DLC) and Boron Carbide (B4C), is pursued. Results from prototypes equipped with DLC and B4C photocathodes exhibited a time resolution of approximately 32 ps and 34.5 ps, respectively. Efforts dedicated to improve detector robustness and stability enhance the feasibility of the PICOSEC Micromegas concept for large experiments, ensuring sustained performance while maintaining excellent timing precision. △ Less

Submitted 13 July, 2024; originally announced July 2024.

arXiv:2407.07329 [pdf, other]

Probability of Differentiation Reveals Brittleness of Homogeneity Bias in Large Language Models

Authors: Messi H. J. Lee, Calvin K. Lai

Abstract: Homogeneity bias in Large Language Models (LLMs) refers to their tendency to homogenize the representations of some groups compared to others. Previous studies documenting this bias have predominantly used encoder models, which may have inadvertently introduced biases. To address this limitation, we prompted GPT-4 to generate single word/expression completions associated with 18 situation cues - s… ▽ More Homogeneity bias in Large Language Models (LLMs) refers to their tendency to homogenize the representations of some groups compared to others. Previous studies documenting this bias have predominantly used encoder models, which may have inadvertently introduced biases. To address this limitation, we prompted GPT-4 to generate single word/expression completions associated with 18 situation cues - specific, measurable elements of environments that influence how individuals perceive situations and compared the variability of these completions using probability of differentiation. This approach directly assessed homogeneity bias from the model's outputs, bypassing encoder models. Across five studies, we find that homogeneity bias is highly volatile across situation cues and writing prompts, suggesting that the bias observed in past work may reflect those within encoder models rather than LLMs. Furthermore, these results suggest that homogeneity bias in LLMs is brittle, as even minor and arbitrary changes in prompts can significantly alter the expression of biases. Future work should further explore how variations in syntactic features and topic choices in longer text generations influence homogeneity bias in LLMs. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.07001 [pdf, ps, other]

Applications of the Green tensor estimates of the nonstationary Stokes system in the half space

Authors: Kyungkeun Kang, Baishun Lai, Chen-Chih Lai, Tai-Peng Tsai

Abstract: In this paper, we present a series of applications of the pointwise estimates of the (unrestricted) Green tensor of the nonstationary Stokes system in the half space, established in our previous work [CMP 2023]. First, we show the $L^1$-$L^q$ estimates for the Stokes flow with possibly non-solenoidal $L^1$ initial data, generalizing the results of Giga-Matsui-Shimizu [Math. Z. 1999] and Desch-Hieb… ▽ More In this paper, we present a series of applications of the pointwise estimates of the (unrestricted) Green tensor of the nonstationary Stokes system in the half space, established in our previous work [CMP 2023]. First, we show the $L^1$-$L^q$ estimates for the Stokes flow with possibly non-solenoidal $L^1$ initial data, generalizing the results of Giga-Matsui-Shimizu [Math. Z. 1999] and Desch-Hieber-Prüss [J. Evol. Equ. 2001]. Second, we construct mild solutions of the Navier-Stokes equations in the half space with mixed-type pointwise decay or with pointwise decay alongside boundary vanishing. Finally, we explore various coupled fluid systems in the half space including viscous resistive magnetohydrodynamics equations, a coupled system for the flow and the magnetic field of MHD type, and the nematic liquid crystal flow. For each of these systems, we construct mild solutions in $L^q$, pointwise decay, and uniformly local $L^q$ spaces. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.06194 [pdf, other]

More Distinctively Black and Feminine Faces Lead to Increased Stereotyping in Vision-Language Models

Authors: Messi H. J. Lee, Jacob M. Montgomery, Calvin K. Lai

Abstract: Vision Language Models (VLMs), exemplified by GPT-4V, adeptly integrate text and vision modalities. This integration enhances Large Language Models' ability to mimic human perception, allowing them to process image inputs. Despite VLMs' advanced capabilities, however, there is a concern that VLMs inherit biases of both modalities in ways that make biases more pervasive and difficult to mitigate. O… ▽ More Vision Language Models (VLMs), exemplified by GPT-4V, adeptly integrate text and vision modalities. This integration enhances Large Language Models' ability to mimic human perception, allowing them to process image inputs. Despite VLMs' advanced capabilities, however, there is a concern that VLMs inherit biases of both modalities in ways that make biases more pervasive and difficult to mitigate. Our study explores how VLMs perpetuate homogeneity bias and trait associations with regards to race and gender. When prompted to write stories based on images of human faces, GPT-4V describes subordinate racial and gender groups with greater homogeneity than dominant groups and relies on distinct, yet generally positive, stereotypes. Importantly, VLM stereotyping is driven by visual cues rather than group membership alone such that faces that are rated as more prototypically Black and feminine are subject to greater stereotyping. These findings suggest that VLMs may associate subtle visual cues related to racial and gender groups with stereotypes in ways that could be challenging to mitigate. We explore the underlying reasons behind this behavior and discuss its implications and emphasize the importance of addressing these biases as VLMs come to mirror human perception. △ Less

Submitted 21 May, 2024; originally announced July 2024.

arXiv:2407.01432 [pdf, ps, other]

$\mathcal{PT}$-Symmetry induced Bi-Stability in Non-Hermitian Cavity Magnomechanics

Authors: Chaoyi Lai, Shah Fahad, Kashif Ammar Yasir

Abstract: We study the steady-state non-Hermitian magnomechanical system driven by a transverse magnetic field directly interacting with YIG sphere and excites cavity magnons and photons. To make the system non-Hermitian, we use a traveling field directly interacting with magnons generating gain to the system. We start by illustrating PT-configuration of the system, which contains two PT broken region aroun… ▽ More We study the steady-state non-Hermitian magnomechanical system driven by a transverse magnetic field directly interacting with YIG sphere and excites cavity magnons and photons. To make the system non-Hermitian, we use a traveling field directly interacting with magnons generating gain to the system. We start by illustrating PT-configuration of the system, which contains two PT broken region around exceptional point and PT protected region along the axis of exceptional point. Late, we discover that the numbers of cavity photons and magnons show bistable behavior depending upon the PT configuration, which becomes more significant as the values of the magnon-photon coupling and traveling field strength increases. We illustrate that steady-state photon only shows bistable behavior when the system in in lossy PT broken configuration, means strength of traveling field is less than the magnon-photon coupling. Otherwise, it will just contain a single stable state because of bistability suppression with gain in the system, which is unlike with any other investigation in this direction. Further, a larger magnon-photon coupling increases photon intensity and decreases magnon intensity, because of photon and magnon energy exchange, leading to enhanced photon bistablity and decreased magnon bistability. However, in case of increasing strength of traveling field, both photon as well as magnon bistability is appeared to be decreasing. We also study the steady-state effective potential of the system and illustrate the occurrence of bistability with nonlinear interactions between contour trajectories, which similarly depends on the PT broken configuration of the system. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 10 pages, 6 figures

arXiv:2407.00607 [pdf, other]

Reducing Quantum Error Correction Overhead with Versatile Flag-Sharing Syndrome Extraction Circuits

Authors: Pei-Hao Liou, Ching-Yi Lai

Abstract: Given that quantum error correction processes are unreliable, an efficient error syndrome extraction circuit should use fewer ancillary qubits, quantum gates, and measurements, while maintaining low circuit depth, to minimizing the circuit area, roughly defined as the product of circuit depth and the number of physical qubits. We propose to design parallel flagged syndrome extraction with shared f… ▽ More Given that quantum error correction processes are unreliable, an efficient error syndrome extraction circuit should use fewer ancillary qubits, quantum gates, and measurements, while maintaining low circuit depth, to minimizing the circuit area, roughly defined as the product of circuit depth and the number of physical qubits. We propose to design parallel flagged syndrome extraction with shared flag qubits for quantum stabilizer codes. Versatile parallelization techniques are employed to minimize the required circuit area, thereby improving the error threshold and overall performance. Specifically, all the measurement outcomes in multiple rounds of syndrome extraction are integrated into a lookup table decoder, allowing us to parallelize multiple stabilizer measurements with shared flag qubits. We present flag-sharing and fully parallel schemes for the [[17,1,5]] and [[19,1,5]] Calderbank-Shor-Steane (CSS) codes. This methodology extends to the [[5,1,3]] non-CSS code, achieving the minimum known circuit area. Numerical simulations have demonstrated improved pseudothresholds for these codes by up to an order of magnitude compared to previous schemes in the literature. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: 19 pages, 22 figures

arXiv:2407.00450 [pdf, other]

Hybrid Quantum-Classical Clustering for Preparing a Prior Distribution of Eigenspectrum

Authors: Mengzhen Ren, Yu-Cheng Chen, Ching-Jui Lai, Min-Hsiu Hsieh, Alice Hu

Abstract: Determining the energy gap in a quantum many-body system is critical to understanding its behavior and is important in quantum chemistry and condensed matter physics. The challenge of determining the energy gap requires identifying both the excited and ground states of a system. In this work, we consider preparing the prior distribution and circuits for the eigenspectrum of time-independent Hamilt… ▽ More Determining the energy gap in a quantum many-body system is critical to understanding its behavior and is important in quantum chemistry and condensed matter physics. The challenge of determining the energy gap requires identifying both the excited and ground states of a system. In this work, we consider preparing the prior distribution and circuits for the eigenspectrum of time-independent Hamiltonians, which can benefit both classical and quantum algorithms for solving eigenvalue problems. The proposed algorithm unfolds in three strategic steps: Hamiltonian transformation, parameter representation, and classical clustering. These steps are underpinned by two key insights: the use of quantum circuits to approximate the ground state of transformed Hamiltonians and the analysis of parameter representation to distinguish between eigenvectors. The algorithm is showcased through applications to the 1D Heisenberg system and the LiH molecular system, highlighting its potential for both near-term quantum devices and fault-tolerant quantum devices. The paper also explores the scalability of the method and its performance across various settings, setting the stage for more resource-efficient quantum computations that are both accurate and fast. The findings presented here mark a new insight into hybrid algorithms, offering a pathway to overcoming current computational challenges. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.18556 [pdf]

Renal digital pathology visual knowledge search platform based on language large model and book knowledge

Authors: Xiaomin Lv, Chong Lai, Liya Ding, Maode Lai, Qingrong Sun

Abstract: Large models have become mainstream, yet their applications in digital pathology still require exploration. Meanwhile renal pathology images play an important role in the diagnosis of renal diseases. We conducted image segmentation and paired corresponding text descriptions based on 60 books for renal pathology, clustering analysis for all image and text description features based on large models,… ▽ More Large models have become mainstream, yet their applications in digital pathology still require exploration. Meanwhile renal pathology images play an important role in the diagnosis of renal diseases. We conducted image segmentation and paired corresponding text descriptions based on 60 books for renal pathology, clustering analysis for all image and text description features based on large models, ultimately building a retrieval system based on the semantic features of large models. Based above analysis, we established a knowledge base of 10,317 renal pathology images and paired corresponding text descriptions, and then we evaluated the semantic feature capabilities of 4 large models, including GPT2, gemma, LLma and Qwen, and the image-based feature capabilities of dinov2 large model. Furthermore, we built a semantic retrieval system to retrieve pathological images based on text descriptions, and named RppD (aidp.zjsru.edu.cn). △ Less

Submitted 26 May, 2024; originally announced June 2024.

Comments: 9 pages, 6 figures

arXiv:2406.09270 [pdf, other]

Discovery and Extensive Follow-Up of SN 2024ggi, a nearby type IIP supernova in NGC 3621

Authors: Ting-Wan Chen, Sheng Yang, Shubham Srivastav, Takashi J. Moriya, Stephen J. Smartt, Sofia Rest, Armin Rest, Hsing Wen Lin, Hao-Yu Miao, Yu-Chi Cheng, Amar Aryan, Chia-Yu Cheng, Morgan Fraser, Li-Ching Huang, Meng-Han Lee, Cheng-Han Lai, Yu Hsuan Liu, Aiswarya Sankar. K, Ken W. Smith, Heloise F. Stevance, Ze-Ning Wang, Joseph P. Anderson, Charlotte R. Angus, Thomas de Boer, Kenneth Chambers , et al. (23 additional authors not shown)

Abstract: We present the discovery and early observations of the nearby Type II supernova (SN) 2024ggi in NGC 3621 at 6.64 +/- 0.3 Mpc. The SN was caught 5.8 (+1.9 -2.9) hours after its explosion by the ATLAS survey. Early-phase, high-cadence, and multi-band photometric follow-up was performed by the Kinder (Kilonova Finder) project, collecting over 1000 photometric data points within a week. The combined o… ▽ More We present the discovery and early observations of the nearby Type II supernova (SN) 2024ggi in NGC 3621 at 6.64 +/- 0.3 Mpc. The SN was caught 5.8 (+1.9 -2.9) hours after its explosion by the ATLAS survey. Early-phase, high-cadence, and multi-band photometric follow-up was performed by the Kinder (Kilonova Finder) project, collecting over 1000 photometric data points within a week. The combined o- and r-band light curves show a rapid rise of 3.3 magnitudes in 13.7 hours, much faster than SN 2023ixf (another recent, nearby, and well-observed SN II). Between 13.8 and 18.8 hours after explosion SN 2024ggi became bluer, with u-g colour dropping from 0.53 to 0.15 mag. The rapid blueward evolution indicates a wind shock breakout (SBO) scenario. No hour-long brightening expected for the SBO from a bare stellar surface was detected during our observations. The classification spectrum, taken 17 hours after the SN explosion, shows flash features of high-ionization species such as Balmer lines, He I, C III, and N III. Detailed light curve modeling reveals critical insights into the properties of the circumstellar material (CSM). Our favoured model has an explosion energy of 2 x 10^51 erg, a mass-loss rate of 10^-3 solar_mass/yr (with an assumed 10 km/s wind), and a confined CSM radius of 6 x 10^14 cm. The corresponding CSM mass is 0.4 solar_mass. Comparisons with SN 2023ixf highlight that SN 2024ggi has a smaller CSM density, resulting in a faster rise and fainter UV flux. The extensive dataset and the involvement of citizen astronomers underscore that a collaborative network is essential for SBO searches, leading to more precise and comprehensive SN characterizations. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 11 pages, 5 figures in manuscript, 6 pages in appendix, submitted to ApJL

arXiv:2406.08353 [pdf, other]

Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques

Authors: Yuanchao Li, Peter Bell, Catherine Lai

Abstract: Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SE… ▽ More Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SER performance using ASR transcripts with varying Word Error Rates (WERs) on well-known corpora: IEMOCAP, CMU-MOSI, and MSP-Podcast. Our evaluation includes text-only and bimodal SER with diverse fusion techniques, aiming for a comprehensive analysis that uncovers novel findings and challenges faced by current SER research. Additionally, we propose a unified ASR error-robust framework integrating ASR error correction and modality-gated fusion, achieving lower WER and higher SER results compared to the best-performing ASR transcript. This research is expected to provide insights into SER with ASR assistance, especially for real-world applications. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.08102 [pdf, other]

Adversarial Patch for 3D Local Feature Extractor

Authors: Yu Wen Pao, Li Chang Lai, Hong-Yi Lin

Abstract: Local feature extractors are the cornerstone of many computer vision tasks. However, their vulnerability to adversarial attacks can significantly compromise their effectiveness. This paper discusses approaches to attack sophisticated local feature extraction algorithms and models to achieve two distinct goals: (1) forcing a match between originally non-matching image regions, and (2) preventing a… ▽ More Local feature extractors are the cornerstone of many computer vision tasks. However, their vulnerability to adversarial attacks can significantly compromise their effectiveness. This paper discusses approaches to attack sophisticated local feature extraction algorithms and models to achieve two distinct goals: (1) forcing a match between originally non-matching image regions, and (2) preventing a match between originally matching regions. At the end of the paper, we discuss the performance and drawbacks of different patch generation methods. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.04553 [pdf, other]

Better Late Than Never: Formulating and Benchmarking Recommendation Editing

Authors: Chengyu Lai, Sheng Zhou, Zhimeng Jiang, Qiaoyu Tan, Yuanchen Bei, Jiawei Chen, Ningyu Zhang, Jiajun Bu

Abstract: Recommendation systems play a pivotal role in suggesting items to users based on their preferences. However, in online platforms, these systems inevitably offer unsuitable recommendations due to limited model capacity, poor data quality, or evolving user interests. Enhancing user experience necessitates efficiently rectify such unsuitable recommendation behaviors. This paper introduces a novel and… ▽ More Recommendation systems play a pivotal role in suggesting items to users based on their preferences. However, in online platforms, these systems inevitably offer unsuitable recommendations due to limited model capacity, poor data quality, or evolving user interests. Enhancing user experience necessitates efficiently rectify such unsuitable recommendation behaviors. This paper introduces a novel and significant task termed recommendation editing, which focuses on modifying known and unsuitable recommendation behaviors. Specifically, this task aims to adjust the recommendation model to eliminate known unsuitable items without accessing training data or retraining the model. We formally define the problem of recommendation editing with three primary objectives: strict rectification, collaborative rectification, and concentrated rectification. Three evaluation metrics are developed to quantitatively assess the achievement of each objective. We present a straightforward yet effective benchmark for recommendation editing using novel Editing Bayesian Personalized Ranking Loss. To demonstrate the effectiveness of the proposed method, we establish a comprehensive benchmark that incorporates various methods from related fields. Codebase is available at https://github.com/cycl2018/Recommendation-Editing. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2406.00317 [pdf, other]

Combining Experimental and Historical Data for Policy Evaluation

Authors: Ting Li, Chengchun Shi, Qianglin Wen, Yang Sui, Yongli Qin, Chunbo Lai, Hongtu Zhu

Abstract: This paper studies policy evaluation with multiple data sources, especially in scenarios that involve one experimental dataset with two arms, complemented by a historical dataset generated under a single control arm. We propose novel data integration methods that linearly integrate base policy value estimators constructed based on the experimental and historical data, with weights optimized to min… ▽ More This paper studies policy evaluation with multiple data sources, especially in scenarios that involve one experimental dataset with two arms, complemented by a historical dataset generated under a single control arm. We propose novel data integration methods that linearly integrate base policy value estimators constructed based on the experimental and historical data, with weights optimized to minimize the mean square error (MSE) of the resulting combined estimator. We further apply the pessimistic principle to obtain more robust estimators, and extend these developments to sequential decision making. Theoretically, we establish non-asymptotic error bounds for the MSEs of our proposed estimators, and derive their oracle, efficiency and robustness properties across a broad spectrum of reward shift scenarios. Numerical experiments and real-data-based analyses from a ridesharing company demonstrate the superior performance of the proposed estimators. △ Less

Submitted 1 June, 2024; originally announced June 2024.

arXiv:2405.20064 [pdf, other]

1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem

Authors: Mingjie Chen, Hezhao Zhang, Yuanchao Li, Jiachen Luo, Wen Wu, Ziyang Ma, Peter Bell, Catherine Lai, Joshua Reiss, Lin Wang, Philip C. Woodland, Xie Chen, Huy Phan, Thomas Hain

Abstract: Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for t… ▽ More Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for training, but problems remain as it sometimes causes over-fitting for minor classes or under-fitting for major classes. This paper presents the system developed by a multi-site team for the participation in the Odyssey 2024 Emotion Recognition Challenge Track-1. The challenge data has the aforementioned properties and therefore the presented systems aimed to tackle these issues, by introducing focal loss in optimisation when applying class weighted loss. Specifically, the focal loss is further weighted by prior-based class weights. Experimental results show that combining these two approaches brings better overall performance, by sacrificing performance on major classes. The system further employs a majority voting strategy to combine the outputs of an ensemble of 7 models. The models are trained independently, using different acoustic features and loss functions - with the aim to have different properties for different data. Hence these models show different performance preferences on major classes and minor classes. The ensemble system output obtained the best performance in the challenge, ranking top-1 among 68 submissions. It also outperformed all single models in our set. On the Odyssey 2024 Emotion Recognition Challenge Task-1 data the system obtained a Macro-F1 score of 35.69% and an accuracy of 37.32%. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.18503 [pdf, other]

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

Authors: Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji

Abstract: Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error… ▽ More Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM's training framework and introduce a novel feature distance by utilizing the teacher's network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM's capability of controllable sound generation in a training-free manner. Our codes, pretrained models, and audio samples are available at https://github.com/sony/soundctm. △ Less

Submitted 10 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: Audio samples: https://koichi-saito-sony.github.io/soundctm/. Codes: https://github.com/sony/soundctm. Checkpoints: https://huggingface.co/Sony/soundctm

arXiv:2405.17768 [pdf, other]

Revisiting the Message Passing in Heterophilous Graph Neural Networks

Authors: Zhuonan Zheng, Yuanchen Bei, Sheng Zhou, Yao Ma, Ming Gu, HongJia XU, Chengyu Lai, Jiawei Chen, Jiajun Bu

Abstract: Graph Neural Networks (GNNs) have demonstrated strong performance in graph mining tasks due to their message-passing mechanism, which is aligned with the homophily assumption that adjacent nodes exhibit similar behaviors. However, in many real-world graphs, connected nodes may display contrasting behaviors, termed as heterophilous patterns, which has attracted increased interest in heterophilous G… ▽ More Graph Neural Networks (GNNs) have demonstrated strong performance in graph mining tasks due to their message-passing mechanism, which is aligned with the homophily assumption that adjacent nodes exhibit similar behaviors. However, in many real-world graphs, connected nodes may display contrasting behaviors, termed as heterophilous patterns, which has attracted increased interest in heterophilous GNNs (HTGNNs). Although the message-passing mechanism seems unsuitable for heterophilous graphs due to the propagation of class-irrelevant information, it is still widely used in many existing HTGNNs and consistently achieves notable success. This raises the question: why does message passing remain effective on heterophilous graphs? To answer this question, in this paper, we revisit the message-passing mechanisms in heterophilous graph neural networks and reformulate them into a unified heterophilious message-passing (HTMP) mechanism. Based on HTMP and empirical analysis, we reveal that the success of message passing in existing HTGNNs is attributed to implicitly enhancing the compatibility matrix among classes. Moreover, we argue that the full potential of the compatibility matrix is not completely achieved due to the existence of incomplete and noisy semantic neighborhoods in real-world heterophilous graphs. To bridge this gap, we introduce a new approach named CMGNN, which operates within the HTMP mechanism to explicitly leverage and improve the compatibility matrix. A thorough evaluation involving 10 benchmark datasets and comparative analysis against 13 well-established baselines highlights the superior performance of the HTMP mechanism and CMGNN method. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.17251 [pdf, other]

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

Authors: Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Seungryong Kim, Yuki Mitsufuji

Abstract: Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to… ▽ More Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to novel views with estimated depth maps, then the warped image is inpainted by T2I models. However, they struggle with noisy depth maps and loss of semantic details when warping an input view to novel viewpoints. In this paper, we propose a novel approach for single-shot novel view synthesis, a semantic-preserving generative warping framework that enables T2I generative models to learn where to warp and where to generate, through augmenting cross-view attention with self-attention. Our approach addresses the limitations of existing methods by conditioning the generative model on source view images and incorporating geometric warping signals. Qualitative and quantitative evaluations demonstrate that our model outperforms existing methods in both in-domain and out-of-domain scenarios. Project page is available at https://GenWarp-NVS.github.io/. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Project page: https://GenWarp-NVS.github.io

arXiv:2405.16677 [pdf, other]

Crossmodal ASR Error Correction with Discrete Speech Units

Authors: Yuanchao Li, Pinzhen Chen, Peter Bell, Catherine Lai

Abstract: ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with… ▽ More ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with 1-best hypothesis transcription. We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon, shedding light on appropriate training schemes for LROOD data. Moreover, we propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality. Results from multiple corpora and several evaluation metrics demonstrate the feasibility and efficacy of our proposed AEC approach on LROOD data, as well as its generalizability and superiority on large-scale data. Finally, a study on speech emotion recognition confirms that our model produces ASR error-robust transcripts suitable for downstream applications. △ Less

Submitted 26 May, 2024; originally announced May 2024.

arXiv:2405.16194 [pdf, other]

Diffusion-Reward Adversarial Imitation Learning

Authors: Chun-Mao Lai, Hsiang-Chun Wang, Ping-Chun Hsieh, Yu-Chiang Frank Wang, Min-Hung Chen, Shao-Hua Sun

Abstract: Imitation learning aims to learn a policy from observing expert demonstrations without access to reward signals from environments. Generative adversarial imitation learning (GAIL) formulates imitation learning as adversarial learning, employing a generator policy learning to imitate expert behaviors and discriminator learning to distinguish the expert demonstrations from agent trajectories. Despit… ▽ More Imitation learning aims to learn a policy from observing expert demonstrations without access to reward signals from environments. Generative adversarial imitation learning (GAIL) formulates imitation learning as adversarial learning, employing a generator policy learning to imitate expert behaviors and discriminator learning to distinguish the expert demonstrations from agent trajectories. Despite its encouraging results, GAIL training is often brittle and unstable. Inspired by the recent dominance of diffusion models in generative modeling, this work proposes Diffusion-Reward Adversarial Imitation Learning (DRAIL), which integrates a diffusion model into GAIL, aiming to yield more precise and smoother rewards for policy learning. Specifically, we propose a diffusion discriminative classifier to construct an enhanced discriminator; then, we design diffusion rewards based on the classifier's output for policy learning. We conduct extensive experiments in navigation, manipulation, and locomotion, verifying DRAIL's effectiveness compared to prior imitation learning methods. Moreover, additional experimental results demonstrate the generalizability and data efficiency of DRAIL. Visualized learned reward functions of GAIL and DRAIL suggest that DRAIL can produce more precise and smoother rewards. △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.14822 [pdf, other]

PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

Authors: Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon

Abstract: To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a technique to progressively grow the resolution of the generator beyo… ▽ More To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a technique to progressively grow the resolution of the generator beyond that of the original teacher DM. Our key insight is that a pre-trained, low-resolution DM can be used to deterministically encode high-resolution data to a structured latent space by solving the PF-ODE forward in time (data-to-noise), starting from an appropriately down-sampled image. Using this frozen encoder in an auto-encoder framework, we train a decoder by progressively growing its resolution. From the nature of progressively growing decoder, PaGoDA avoids re-training teacher/student models when we upsample the student model, making the whole training pipeline much cheaper. In experiments, we used our progressively growing decoder to upsample from the pre-trained model's 64x64 resolution to generate 512x512 samples, achieving 2x faster inference compared to single-step distilled Stable Diffusion like LCM. PaGoDA also achieved state-of-the-art FIDs on ImageNet across all resolutions from 64x64 to 512x512. Additionally, we demonstrated PaGoDA's effectiveness in solving inverse problems and enabling controllable generation. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2404.19228 [pdf, other]

Understanding Multimodal Contrastive Learning Through Pointwise Mutual Information

Authors: Toshimitsu Uesaka, Taiji Suzuki, Yuhta Takida, Chieh-Hsin Lai, Naoki Murata, Yuki Mitsufuji

Abstract: Multimodal representation learning to integrate different modalities, such as text, vision, and audio is important for real-world applications. The symmetric InfoNCE loss proposed in CLIP is a key concept in multimodal representation learning. In this work, we provide a theoretical understanding of the symmetric InfoNCE loss through the lens of the pointwise mutual information and show that encode… ▽ More Multimodal representation learning to integrate different modalities, such as text, vision, and audio is important for real-world applications. The symmetric InfoNCE loss proposed in CLIP is a key concept in multimodal representation learning. In this work, we provide a theoretical understanding of the symmetric InfoNCE loss through the lens of the pointwise mutual information and show that encoders that achieve the optimal similarity in the pretraining provide a good representation for downstream classification tasks under mild assumptions. Based on our theoretical results, we also propose a new similarity metric for multimodal contrastive learning by utilizing a nonlinear kernel to enrich the capability. To verify the effectiveness of the proposed method, we demonstrate pretraining of multimodal representation models on the Conceptual Caption datasets and evaluate zero-shot classification and linear classification on common benchmark datasets. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.18586

How to surpass no-go limits in Gaussian quantum error correction and entangled Gaussian state distillation?

Authors: En-Jui Chang, Ching-Yi Lai

Abstract: Gaussian quantum information processing with continuous-variable (CV) quantum information carriers holds significant promise for applications in quantum communication and quantum internet. However, applying Gaussian state distillation and quantum error correction (QEC) faces limitations imposed by no-go results concerning local Gaussian unitary operations and classical communications. This paper i… ▽ More Gaussian quantum information processing with continuous-variable (CV) quantum information carriers holds significant promise for applications in quantum communication and quantum internet. However, applying Gaussian state distillation and quantum error correction (QEC) faces limitations imposed by no-go results concerning local Gaussian unitary operations and classical communications. This paper introduces a Gaussian QEC protocol that relies solely on local Gaussian resources. A pivotal component of our approach is CV gate teleportation using entangled Gaussian states, which facilitates the implementation of the partial transpose operation on a quantum channel. Consequently, we can efficiently construct a two-mode noise-polarized channel from two noisy Gaussian channels. Furthermore, this QEC protocol naturally extends to a nonlocal Gaussian state distillation protocol. △ Less

Submitted 7 May, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

Comments: Lemma 3 and Lemma 4 are incorrect

arXiv:2404.13227 [pdf, other]

Machine learning for climate physics and simulations

Authors: Ching-Yao Lai, Pedram Hassanzadeh, Aditi Sheshadri, Maike Sonnewald, Raffaele Ferrari, Venkatramani Balaji

Abstract: We discuss the emerging advances and opportunities at the intersection of machine learning (ML) and climate physics, highlighting the use of ML techniques, including supervised, unsupervised, and equation discovery, to accelerate climate knowledge discoveries and simulations. We delineate two distinct yet complementary aspects: (1) ML for climate physics and (2) ML for climate simulations. While p… ▽ More We discuss the emerging advances and opportunities at the intersection of machine learning (ML) and climate physics, highlighting the use of ML techniques, including supervised, unsupervised, and equation discovery, to accelerate climate knowledge discoveries and simulations. We delineate two distinct yet complementary aspects: (1) ML for climate physics and (2) ML for climate simulations. While physics-free ML-based models, such as ML-based weather forecasting, have demonstrated success when data is abundant and stationary, the physics knowledge and interpretability of ML models become crucial in the small-data/non-stationary regime to ensure generalizability. Given the absence of observations, the long-term future climate falls into the small-data regime. Therefore, ML for climate physics holds a critical role in addressing the challenges of ML for climate simulations. We emphasize the need for collaboration among climate physics, ML theory, and numerical analysis to achieve reliable ML-based models for climate applications. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.09385 [pdf, other]

A Large-Scale Evaluation of Speech Foundation Models

Authors: Shu-wen Yang, Heng-Jui Chang, Zili Huang, Andy T. Liu, Cheng-I Lai, Haibin Wu, Jiatong Shi, Xuankai Chang, Hsiang-Sheng Tsai, Wen-Chin Huang, Tzu-hsun Feng, Po-Han Chi, Yist Y. Lin, Yung-Sung Chuang, Tzu-Hsien Huang, Wei-Cheng Tseng, Kushal Lakhotia, Shang-Wen Li, Abdelrahman Mohamed, Shinji Watanabe, Hung-yi Lee

Abstract: The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work,… ▽ More The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. For reproducibility and extensibility, we have developed a long-term maintained platform that enables deterministic benchmarking, allows for result sharing via an online leaderboard, and promotes collaboration through a community-driven benchmark database to support new development cycles. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models, the correctness of the weighted-sum benchmarking protocol and the statistical significance and robustness of the benchmark. △ Less

Submitted 29 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: The extended journal version for SUPERB and SUPERB-SG. Published in IEEE/ACM TASLP. The Arxiv version is preferred

arXiv:2404.03268 [pdf, other]

Efficient Ground State Estimation Using Generalized Hund's Rule

Authors: Leo Chiang, Ching-Jui Lai

Abstract: Quantum computers offer a promising approach to simulate the ground state of molecules, which is crucial for understanding molecular properties and chemical reactions. However, the limited number of available qubits on current devices poses a challenge for simulation. This paper investigates the feasibility of reducing the qubit usage of molecular simulation by examining specific fermionic states… ▽ More Quantum computers offer a promising approach to simulate the ground state of molecules, which is crucial for understanding molecular properties and chemical reactions. However, the limited number of available qubits on current devices poses a challenge for simulation. This paper investigates the feasibility of reducing the qubit usage of molecular simulation by examining specific fermionic states according to Hund's rule. We introduced a new framework based on qubit efficiency encoding. Based on this framework, the Hamiltonian is restricted to the Hund subspace. Compared to only concerned particle conservation, the proposed method can reduce $N$ qubit usage for an $M$ orbitals and $N$ electrons molecule when $M\gg N$. Additionally, when using the STO-3G basis sets, the simulations of the $15$ molecules with given molecular geometry by the proposed method are close to the full configuration interaction. The absolute difference is at most $0.121\%$. Meanwhile, predictions from potential energy surfaces using the proposed method have an absolute difference at most $4.1\%$. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2404.02846 [pdf, ps, other]

On the Springer correspondence for wreath products

Authors: You-Hung Hsu, Chun-Ju Lai

Abstract: We first show that the wreath product $Σ_m\wr Σ_d$ between two symmetric groups appears as the generalized Weyl group of an Iwahori's generalized Tits system. We then introduce a certain subvariety of the flag variety of type A, and then give a geometric proof of its Bruhat decomposition indexed by $Σ_m\wr Σ_d$, via the Bialynicki-Birula decomposition. Furthermore, we realize the group algebra… ▽ More We first show that the wreath product $Σ_m\wr Σ_d$ between two symmetric groups appears as the generalized Weyl group of an Iwahori's generalized Tits system. We then introduce a certain subvariety of the flag variety of type A, and then give a geometric proof of its Bruhat decomposition indexed by $Σ_m\wr Σ_d$, via the Bialynicki-Birula decomposition. Furthermore, we realize the group algebra $\mathbb{Q}[Σ_m\wr Σ_d]$ as the top Borel-Moore homology of a Steinberg variety. Such a geometric realization leads to a Springer correspondence for the irreducible representations over $\mathbb{C}[Σ_m\wr Σ_d]$, which can be regarded as a counterpart of the Clifford theory for wreath products. Consequently, we have obtained a new Springer correspondence of type B/C/D using essentially type A geometry. △ Less

Submitted 18 April, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

Comments: 17 pages. v2: exposition improved

arXiv:2403.11211 [pdf]

RCdpia: A Renal Carcinoma Digital Pathology Image Annotation dataset based on pathologists

Authors: Qingrong Sun, Weixiang Zhong, Jie Zhou, Chong Lai, Xiaodong Teng, Maode Lai

Abstract: The annotation of digital pathological slide data for renal cell carcinoma is of paramount importance for correct diagnosis of artificial intelligence models due to the heterogeneous nature of the tumor. This process not only facilitates a deeper understanding of renal cell cancer heterogeneity but also aims to minimize noise in the data for more accurate studies. To enhance the applicability of t… ▽ More The annotation of digital pathological slide data for renal cell carcinoma is of paramount importance for correct diagnosis of artificial intelligence models due to the heterogeneous nature of the tumor. This process not only facilitates a deeper understanding of renal cell cancer heterogeneity but also aims to minimize noise in the data for more accurate studies. To enhance the applicability of the data, two pathologists were enlisted to meticulously curate, screen, and label a kidney cancer pathology image dataset from The Cancer Genome Atlas Program (TCGA) database. Subsequently, a Resnet model was developed to validate the annotated dataset against an additional dataset from the First Affiliated Hospital of Zhejiang University. Based on these results, we have meticulously compiled the TCGA digital pathological dataset with independent labeling of tumor regions and adjacent areas (RCdpia), which includes 109 cases of kidney chromophobe cell carcinoma, 486 cases of kidney clear cell carcinoma, and 292 cases of kidney papillary cell carcinoma. This dataset is now publicly accessible at http://39.171.241.18:8888/RCdpia/. Furthermore, model analysis has revealed significant discrepancies in predictive outcomes when applying the same model to datasets from different centers. Leveraging the RCdpia, we can now develop more precise digital pathology artificial intelligence models for tasks such as normalization, classification, and segmentation. These advancements underscore the potential for more nuanced and accurate AI applications in the field of digital pathology. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Comments: 8 pages, 3 figures, 1 table

arXiv:2402.19383 [pdf, other]

Harnessing Coding Theory for Reliable Network Quantum Communication

Authors: Ching-Yi Lai, Kao-Yueh Kuo

Abstract: This article explores the application of coding techniques for fault-tolerant quantum computation and extends their usage to fault-tolerant quantum communication. We review repeater-based quantum networks, emphasizing the roles of coding theory and fault-tolerant quantum operations, particularly in the context of quantum teleportation. We highlight that fault-tolerant implementation of the Bell me… ▽ More This article explores the application of coding techniques for fault-tolerant quantum computation and extends their usage to fault-tolerant quantum communication. We review repeater-based quantum networks, emphasizing the roles of coding theory and fault-tolerant quantum operations, particularly in the context of quantum teleportation. We highlight that fault-tolerant implementation of the Bell measurement enables reliable quantum communication without requiring a universal set of quantum gates. Finally, we discuss various quantum code candidates for achieving higher transmission rates. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: 7 pages, 5 figures

arXiv:2402.15630 [pdf]

In-beam test results of an RPC-based module for position-sensitive neutron detectors with timing readout

Authors: G. Canezin, L. M. S. Margato, A. Morozov, A. Blanco, J. Saraiva, L. Lopes, P. Fonte, Chung Chuan Lai, Per-Olof Svensson, G. Markaj, Florian M. Piegsa

Abstract: Recently we have proposed a new concept of a thermal neutron detector based on resistive plate chambers and 10B4C solid neutron converters, enabling to readout with high resolution in both the 3D position of neutron capture and the neutron time of flight (ToF). In this paper, we report the results of the first beam tests conducted with a new neutron RPC detection module, coupled to the position re… ▽ More Recently we have proposed a new concept of a thermal neutron detector based on resistive plate chambers and 10B4C solid neutron converters, enabling to readout with high resolution in both the 3D position of neutron capture and the neutron time of flight (ToF). In this paper, we report the results of the first beam tests conducted with a new neutron RPC detection module, coupled to the position readout units of a new design. The main focus is on the measurements of the neutron ToF and identification of the converter layer where the neutron is captured, giving the position along the beam direction. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.14905 [pdf, other]

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Authors: Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra

Abstract: This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our in… ▽ More This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight-sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases. △ Less

Submitted 26 June, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: ICML 2024. Code is available at https://github.com/facebookresearch/MobileLLM

arXiv:2402.14677 [pdf, other]

Influence of thermal effects on atomic Bloch oscillation

Authors: Guoling Yin, Chi-Kin Lai, Nana Chang, Yi Liang, Dekai Mao, Xiaoji Zhou

Abstract: Advancements in the experimental toolbox of cold atoms have enabled the meticulous control of atomic Bloch oscillation within optical lattices, thereby enhancing the capabilities of gravity interferometers. This work delves into the impact of thermal effects on Bloch oscillation in 1D accelerated optical lattices aligned with gravity by varying the system's initial temperature. Through the applica… ▽ More Advancements in the experimental toolbox of cold atoms have enabled the meticulous control of atomic Bloch oscillation within optical lattices, thereby enhancing the capabilities of gravity interferometers. This work delves into the impact of thermal effects on Bloch oscillation in 1D accelerated optical lattices aligned with gravity by varying the system's initial temperature. Through the application of Raman cooling, we effectively reduce the longitudinal thermal effect, stabilizing the longitudinal coherence length over the timescale of its lifetime. The atomic losses over multiple Bloch oscillation is measured, which are primarily attributed to transverse excitation. Furthermore, we identify two distinct inverse scaling behaviors in the oscillation lifetime scaled by the corresponding density with respect to temperatures, implying diverse equilibrium processes within or outside the Bose-Einstein condensate regime. The competition between the system's coherence and atomic density leads to a relatively smooth variation in the actual lifetime versus temperature. Our findings provide valuable insights into the interaction between thermal effects and Bloch oscillation, offering avenues for the refinement of quantum measurement technologies. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: 8 pages, 7 figures

arXiv:2402.08643 [pdf, other]

Learned Image Compression with Text Quality Enhancement

Authors: Chih-Yu Lai, Dung Tran, Kazuhito Koishida

Abstract: Learned image compression has gained widespread popularity for their efficiency in achieving ultra-low bit-rates. Yet, images containing substantial textual content, particularly screen-content images (SCI), often suffers from text distortion at such compressed levels. To address this, we propose to minimize a novel text logit loss designed to quantify the disparity in text between the original an… ▽ More Learned image compression has gained widespread popularity for their efficiency in achieving ultra-low bit-rates. Yet, images containing substantial textual content, particularly screen-content images (SCI), often suffers from text distortion at such compressed levels. To address this, we propose to minimize a novel text logit loss designed to quantify the disparity in text between the original and reconstructed images, thereby improving the perceptual quality of the reconstructed text. Through rigorous experimentation across diverse datasets and employing state-of-the-art algorithms, our findings reveal significant enhancements in the quality of reconstructed text upon integration of the proposed loss function with appropriate weighting. Notably, we achieve a Bjontegaard delta (BD) rate of -32.64% for Character Error Rate (CER) and -28.03% for Word Error Rate (WER) on average by applying the text logit loss for two screenshot datasets. Additionally, we present quantitative metrics tailored for evaluating text quality in image compression tasks. Our findings underscore the efficacy and potential applicability of our proposed text logit loss function across various text-aware image compression contexts. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: Submitted to ICIP 2024

arXiv:2402.08325 [pdf, other]

doi 10.1088/1748-0221/19/05/P05010

Multi-Blade detector with VMM3a-ASIC-based readout: installation and commissioning at the reflectometer Amor at PSI

Authors: F. Piscitelli, F. Ghazi Moradi, F. S. Alves, M. J. Christensen, J. Hrivnak, A. Johansson, K. Fissum, C. C. Lai, A. Monera Martinez, D. Pfeiffer, E. Shahu, J. Stahn, P. O. Svensson

Abstract: The Multi-Blade (MB) Boron-10-based neutron detector is the chosen technology for three instruments at the European Spallation Source (ESS): the two ESS reflectometers, ESTIA and FREIA, and the Test Beam Line. A fourth MB detector has been built, installed and commissioned for the user operation of the reflectometer Amor at PSI (Switzerland). Amor can be considered a downscaled version of the ESS… ▽ More The Multi-Blade (MB) Boron-10-based neutron detector is the chosen technology for three instruments at the European Spallation Source (ESS): the two ESS reflectometers, ESTIA and FREIA, and the Test Beam Line. A fourth MB detector has been built, installed and commissioned for the user operation of the reflectometer Amor at PSI (Switzerland). Amor can be considered a downscaled version of the ESS reflectometer ESTIA. They are based on the same Selene guide concept, optimized for performing focusing reflectometry on small samples. The experience gained at Amor is invaluable for the future deployment of the MB detector at the ESS. This manuscript describes the MB detector construction and installation at Amor along with the readout electronics chain based on the VMM3a ASIC. The readout chain deployed at Amor is equivalent of that of the ESS, including the readout master module (RMM), event-formation-units (EFUs), Kafka, FileWriter and live visualisation tools. △ Less

Submitted 18 March, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: 16 pages, 12 figures

Journal ref: 2024 JINST 19 P05010

arXiv:2402.02617 [pdf, other]

Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study on Speech Emotion Recognition

Authors: Alexandra Saliba, Yuanchao Li, Ramon Sanabria, Catherine Lai

Abstract: The efficacy of self-supervised speech models has been validated, yet the optimal utilization of their representations remains challenging across diverse tasks. In this study, we delve into Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks. AWEs have previously shown utility in capturing acoustic discrimin… ▽ More The efficacy of self-supervised speech models has been validated, yet the optimal utilization of their representations remains challenging across diverse tasks. In this study, we delve into Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks. AWEs have previously shown utility in capturing acoustic discriminability. In light of this, we propose measuring layer-wise similarity between AWEs and word embeddings, aiming to further investigate the inherent context within AWEs. Moreover, we evaluate the contribution of AWEs, in comparison to other types of speech features, in the context of Speech Emotion Recognition (SER). Through a comparative experiment and a layer-wise accuracy analysis on two distinct corpora, IEMOCAP and ESD, we explore differences between AWEs and raw self-supervised representations, as well as the proper utilization of AWEs alone and in combination with word embeddings. Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive SER accuracies by appropriately employing AWEs. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: Accepted to ICASSP2024 Self-supervision in Audio, Speech and Beyond (SASB) workshop. First two authors contributed equally

arXiv:2401.10711 [pdf, other]

Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering

Authors: Haibo Wang, Chenghang Lai, Yixuan Sun, Weifeng Ge

Abstract: Video Question Answering (VideoQA) aims to answer natural language questions based on the information observed in videos. Despite the recent success of Large Multimodal Models (LMMs) in image-language understanding and reasoning, they deal with VideoQA insufficiently, by simply taking uniformly sampled frames as visual inputs, which ignores question-relevant visual clues. Moreover, there are no hu… ▽ More Video Question Answering (VideoQA) aims to answer natural language questions based on the information observed in videos. Despite the recent success of Large Multimodal Models (LMMs) in image-language understanding and reasoning, they deal with VideoQA insufficiently, by simply taking uniformly sampled frames as visual inputs, which ignores question-relevant visual clues. Moreover, there are no human annotations for question-critical timestamps in existing VideoQA datasets. In light of this, we propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs. Specifically, we first fuse the question and answer pairs as event descriptions to find multiple keyframes as target moments and pseudo-labels, with the visual-language alignment capability of the CLIP models. With these pseudo-labeled keyframes as additionally weak supervision, we devise a lightweight Gaussian-based Contrastive Grounding (GCG) module. GCG learns multiple Gaussian functions to characterize the temporal structure of the video, and sample question-critical frames as positive moments to be the visual inputs of LMMs. Extensive experiments on several benchmarks verify the effectiveness of our framework, and we achieve substantial improvements compared to previous state-of-the-art methods. △ Less

Submitted 26 April, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

arXiv:2401.09695 [pdf]

Should ChatGPT Write Your Breakup Text? Exploring the Role of AI in Relationship Dissolution

Authors: Yue Fu, Yixin Chen, Zelia Gomes Da Costa Lai, Alexis Hiniker

Abstract: Relationships are essential to our happiness and wellbeing. The dissolution of a relationship, the final stage of relationship's lifecycle and one of the most stressful events in an individual's life, can have profound and long-lasting impacts on people. With the breakup process increasingly facilitated by computer-mediated communication (CMC), and the likely future influence of AI-mediated commun… ▽ More Relationships are essential to our happiness and wellbeing. The dissolution of a relationship, the final stage of relationship's lifecycle and one of the most stressful events in an individual's life, can have profound and long-lasting impacts on people. With the breakup process increasingly facilitated by computer-mediated communication (CMC), and the likely future influence of AI-mediated communication (AIMC) tools, we conducted a semi-structured interview study with 21 participants. We aim to understand: 1) the current role of technology in the breakup process, 2) the needs and support individuals have during the process, and 3) how AI might address these needs. Our research shows that people have distinct needs at various stages of ending a relationship. Presently, technology is used for information gathering and community support, acting as a catalyst for breakups, enabling ghosting and blocking, and facilitating communication. Participants anticipate that AI could aid in sense-making of their relationship leading up to the breakup, act as a mediator, assist in crafting appropriate wording, tones, and language during breakup conversations, and support companionship, reflection, recovery, and growth after a breakup. Our findings also demonstrate an overlap between the breakup process and the Transtheoretical Model (TTM) of behavior change. Through the lens of TTM, we explore the potential support and affordances AI could offer in breakups, including its benefits and the necessary precautions regarding AI's role in this sensitive process. △ Less

Submitted 17 January, 2024; originally announced January 2024.

arXiv:2401.08495 [pdf, other]

doi 10.1145/3630106.3658975

Large Language Models Portray Socially Subordinate Groups as More Homogeneous, Consistent with a Bias Observed in Humans

Authors: Messi H. J. Lee, Jacob M. Montgomery, Calvin K. Lai

Abstract: Large language models (LLMs) are becoming pervasive in everyday life, yet their propensity to reproduce biases inherited from training data remains a pressing concern. Prior investigations into bias in LLMs have focused on the association of social groups with stereotypical attributes. However, this is only one form of human bias such systems may reproduce. We investigate a new form of bias in LLM… ▽ More Large language models (LLMs) are becoming pervasive in everyday life, yet their propensity to reproduce biases inherited from training data remains a pressing concern. Prior investigations into bias in LLMs have focused on the association of social groups with stereotypical attributes. However, this is only one form of human bias such systems may reproduce. We investigate a new form of bias in LLMs that resembles a social psychological phenomenon where socially subordinate groups are perceived as more homogeneous than socially dominant groups. We had ChatGPT, a state-of-the-art LLM, generate texts about intersectional group identities and compared those texts on measures of homogeneity. We consistently found that ChatGPT portrayed African, Asian, and Hispanic Americans as more homogeneous than White Americans, indicating that the model described racial minority groups with a narrower range of human experience. ChatGPT also portrayed women as more homogeneous than men, but these differences were small. Finally, we found that the effect of gender differed across racial/ethnic groups such that the effect of gender was consistent within African and Hispanic Americans but not within Asian and White Americans. We argue that the tendency of LLMs to describe groups as less diverse risks perpetuating stereotypes and discriminatory behavior. △ Less

Submitted 25 April, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: Forthcoming at ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2024

arXiv:2401.01329 [pdf, other]

Self-Supervised Millimeter Wave Indoor Localization using Tiny Neural Networks

Authors: Anish Shastri, Steve Blandino, Camillo Gentile, Chiehping Lai, Paolo Casari

Abstract: The quasi-optical propagation of millimeter-wave signals enables high-accuracy localization algorithms that employ geometric approaches or machine learning models. However, most algorithms require information on the indoor environment, may entail the collection of large training datasets, or bear an infeasible computational burden for commercial off-the-shelf (COTS) devices. In this work, we propo… ▽ More The quasi-optical propagation of millimeter-wave signals enables high-accuracy localization algorithms that employ geometric approaches or machine learning models. However, most algorithms require information on the indoor environment, may entail the collection of large training datasets, or bear an infeasible computational burden for commercial off-the-shelf (COTS) devices. In this work, we propose to use tiny neural networks (NNs) to learn the relationship between angle difference-of-arrival (ADoA) measurements and locations of a receiver in an indoor environment. To relieve training data collection efforts, we resort to a self-supervised approach by bootstrapping the training of our neural network through location estimates obtained from a state-of-the-art localization algorithm. We evaluate our scheme via mmWave measurements from indoor 60-GHz double-directional channel sounding. We process the measurements to yield dominant multipath components, use the corresponding angles to compute ADoA values, and finally obtain location fixes. Results show that the tiny NN achieves sub-meter errors in 74\% of the cases, thus performing as good as or even better than the state-of-the-art algorithm, with significantly lower computational complexity. △ Less

Submitted 2 January, 2024; originally announced January 2024.

Comments: 13 pages, 11 figures

arXiv:2401.00365 [pdf, other]

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Authors: Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Wei-Hsiang Liao, Yuki Mitsufuji

Abstract: Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the co… ▽ More Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the codebook is not efficiently used to express the data, and hence degrades reconstruction accuracy. To mitigate this problem, we propose a novel unified framework to stochastically learn hierarchical discrete representation on the basis of the variational Bayes framework, called hierarchically quantized variational autoencoder (HQ-VAE). HQ-VAE naturally generalizes the hierarchical variants of VQ-VAE, such as VQ-VAE-2 and residual-quantized VAE (RQ-VAE), and provides them with a Bayesian training scheme. Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance. We also validated HQ-VAE in terms of its applicability to a different modality with an audio dataset. △ Less

Submitted 28 March, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

Comments: 34 pages with 17 figures, accepted for TMLR

arXiv:2312.13594 [pdf, other]

Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA

Authors: Chengen Lai, Shengli Song, Shiqi Meng, Jingyang Li, Sitong Yan, Guangneng Hu

Abstract: Natural language explanation in visual question answer (VQA-NLE) aims to explain the decision-making process of models by generating natural language sentences to increase users' trust in the black-box systems. Existing post-hoc methods have achieved significant progress in obtaining a plausible explanation. However, such post-hoc explanations are not always aligned with human logical inference, s… ▽ More Natural language explanation in visual question answer (VQA-NLE) aims to explain the decision-making process of models by generating natural language sentences to increase users' trust in the black-box systems. Existing post-hoc methods have achieved significant progress in obtaining a plausible explanation. However, such post-hoc explanations are not always aligned with human logical inference, suffering from the issues on: 1) Deductive unsatisfiability, the generated explanations do not logically lead to the answer; 2) Factual inconsistency, the model falsifies its counterfactual explanation for answers without considering the facts in images; and 3) Semantic perturbation insensitivity, the model can not recognize the semantic changes caused by small perturbations. These problems reduce the faithfulness of explanations generated by models. To address the above issues, we propose a novel self-supervised \textbf{M}ulti-level \textbf{C}ontrastive \textbf{L}earning based natural language \textbf{E}xplanation model (MCLE) for VQA with semantic-level, image-level, and instance-level factual and counterfactual samples. MCLE extracts discriminative features and aligns the feature spaces from explanations with visual question and answer to generate more consistent explanations. We conduct extensive experiments, ablation analysis, and case study to demonstrate the effectiveness of our method on two VQA-NLE benchmarks. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: AAAI 2024

arXiv:2312.01319 [pdf, ps, other]

Erdős similarity problem via bi-Lipschitz embedding

Authors: De-jun Feng, Chun-Kit Lai, Ying Xiong

Abstract: The Erdős similarity conjecture asserted that an infinite set of real numbers cannot be affinely embedded into every measurable set of positive Lebesgue measure. The problem is still open, in particular for all fast decaying sequences. In this paper, we relax the problem to the bi-Lipschitz embedding and obtain some sharp criteria about the bi-Lipschitz Erdős similarity problem for strictly decrea… ▽ More The Erdős similarity conjecture asserted that an infinite set of real numbers cannot be affinely embedded into every measurable set of positive Lebesgue measure. The problem is still open, in particular for all fast decaying sequences. In this paper, we relax the problem to the bi-Lipschitz embedding and obtain some sharp criteria about the bi-Lipschitz Erdős similarity problem for strictly decreasing sequences. △ Less

Submitted 3 December, 2023; originally announced December 2023.

MSC Class: 28A78; 28A05; 30L05; 11K55

arXiv:2311.16424 [pdf, other]

Manifold Preserving Guided Diffusion

Authors: Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, Stefano Ermon

Abstract: Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework that leverages pretrained diffusion models and off-the-shelf neural networks with minimal additional inference cost for a broad… ▽ More Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework that leverages pretrained diffusion models and off-the-shelf neural networks with minimal additional inference cost for a broad range of tasks. Specifically, we leverage the manifold hypothesis to refine the guided diffusion steps and introduce a shortcut algorithm in the process. We then propose two methods for on-manifold training-free guidance using pre-trained autoencoders and demonstrate that our shortcut inherently preserves the manifolds when applied to latent diffusion models. Our experiments show that MPGD is efficient and effective for solving a variety of conditional generation applications in low-compute settings, and can consistently offer up to 3.8x speed-ups with the same number of diffusion steps while maintaining high sample quality compared to the baselines. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2311.07111 [pdf, ps, other]

Semidefinite programming bounds on the size of entanglement-assisted codeword stabilized quantum codes

Authors: Ching-Yi Lai, Pin-Chieh Tseng, Wei-Hsuan Yu

Abstract: In this paper, we explore the application of semidefinite programming to the realm of quantum codes, specifically focusing on codeword stabilized (CWS) codes with entanglement assistance. Notably, we utilize the isotropic subgroup of the CWS group and the set of word operators of a CWS-type quantum code to derive an upper bound on the minimum distance. Furthermore, this characterization can be inc… ▽ More In this paper, we explore the application of semidefinite programming to the realm of quantum codes, specifically focusing on codeword stabilized (CWS) codes with entanglement assistance. Notably, we utilize the isotropic subgroup of the CWS group and the set of word operators of a CWS-type quantum code to derive an upper bound on the minimum distance. Furthermore, this characterization can be incorporated into the associated distance enumerators, enabling us to construct semidefinite constraints that lead to SDP bounds on the minimum distance or size of CWS-type quantum codes. We illustrate several instances where SDP bounds outperform LP bounds, and there are even cases where LP fails to yield meaningful results, while SDP consistently provides tight and relevant bounds. Finally, we also provide interpretations of the Shor-Laflamme weight enumerators and shadow enumerators for codeword stabilized codes, enhancing our understanding of quantum codes. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: 20 pages, 1 table

arXiv:2311.04149 [pdf, other]

HyperS2V: A Framework for Structural Representation of Nodes in Hyper Networks

Authors: Shu Liu, Cameron Lai, Fujio Toriumi

Abstract: In contrast to regular (simple) networks, hyper networks possess the ability to depict more complex relationships among nodes and store extensive information. Such networks are commonly found in real-world applications, such as in social interactions. Learning embedded representations for nodes involves a process that translates network structures into more simplified spaces, thereby enabling the… ▽ More In contrast to regular (simple) networks, hyper networks possess the ability to depict more complex relationships among nodes and store extensive information. Such networks are commonly found in real-world applications, such as in social interactions. Learning embedded representations for nodes involves a process that translates network structures into more simplified spaces, thereby enabling the application of machine learning approaches designed for vector data to be extended to network data. Nevertheless, there remains a need to delve into methods for learning embedded representations that prioritize structural aspects. This research introduces HyperS2V, a node embedding approach that centers on the structural similarity within hyper networks. Initially, we establish the concept of hyper-degrees to capture the structural properties of nodes within hyper networks. Subsequently, a novel function is formulated to measure the structural similarity between different hyper-degree values. Lastly, we generate structural embeddings utilizing a multi-scale random walk framework. Moreover, a series of experiments, both intrinsic and extrinsic, are performed on both toy and real networks. The results underscore the superior performance of HyperS2V in terms of both interpretability and applicability to downstream tasks. △ Less

Submitted 7 November, 2023; originally announced November 2023.

arXiv:2310.15416 [pdf, other]

Nominality Score Conditioned Time Series Anomaly Detection by Point/Sequential Reconstruction

Authors: Chih-Yu Lai, Fan-Keng Sun, Zhengqi Gao, Jeffrey H. Lang, Duane S. Boning

Abstract: Time series anomaly detection is challenging due to the complexity and variety of patterns that can occur. One major difficulty arises from modeling time-dependent relationships to find contextual anomalies while maintaining detection accuracy for point anomalies. In this paper, we propose a framework for unsupervised time series anomaly detection that utilizes point-based and sequence-based recon… ▽ More Time series anomaly detection is challenging due to the complexity and variety of patterns that can occur. One major difficulty arises from modeling time-dependent relationships to find contextual anomalies while maintaining detection accuracy for point anomalies. In this paper, we propose a framework for unsupervised time series anomaly detection that utilizes point-based and sequence-based reconstruction models. The point-based model attempts to quantify point anomalies, and the sequence-based model attempts to quantify both point and contextual anomalies. Under the formulation that the observed time point is a two-stage deviated value from a nominal time point, we introduce a nominality score calculated from the ratio of a combined value of the reconstruction errors. We derive an induced anomaly score by further integrating the nominality score and anomaly score, then theoretically prove the superiority of the induced anomaly score over the original anomaly score under certain conditions. Extensive studies conducted on several public datasets show that the proposed framework outperforms most state-of-the-art baselines for time series anomaly detection. △ Less

Submitted 23 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023 (https://neurips.cc/virtual/2023/poster/70582)

arXiv:2310.13267 [pdf, other]

On the Language Encoder of Contrastive Cross-modal Models

Authors: Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji

Abstract: Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding… ▽ More Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding training affect language encoder quality and cross-modal task performance. In VL pretraining, we found that sentence embedding training language encoder quality and aids in cross-modal tasks, improving contrastive VL models such as CyCLIP. In contrast, AL pretraining benefits less from sentence embedding training, which may result from the limited amount of pretraining data. We analyze the representation spaces to understand the strengths of sentence embedding training, and find that it improves text-space uniformity, at the cost of decreased cross-modal alignment. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.12682 [pdf, other]

Correcting phenomenological quantum noise via belief propagation

Authors: Kao-Yueh Kuo, Ching-Yi Lai

Abstract: Quantum stabilizer codes often face the challenge of syndrome errors due to error-prone measurements. To address this issue, multiple rounds of syndrome extraction are typically employed to obtain reliable error syndromes. In this paper, we consider phenomenological decoding problems, where data qubit errors may occur between two syndrome extractions, and each syndrome measurement can be faulty. T… ▽ More Quantum stabilizer codes often face the challenge of syndrome errors due to error-prone measurements. To address this issue, multiple rounds of syndrome extraction are typically employed to obtain reliable error syndromes. In this paper, we consider phenomenological decoding problems, where data qubit errors may occur between two syndrome extractions, and each syndrome measurement can be faulty. To handle these diverse error sources, we define a generalized check matrix over mixed quaternary and binary alphabets to characterize their error syndromes. This generalized check matrix leads to the creation of a Tanner graph comprising quaternary and binary variable nodes, which facilitates the development of belief propagation (BP) decoding algorithms to tackle phenomenological errors. Importantly, our BP decoders are applicable to general sparse quantum codes. Through simulations of quantum memory protected by rotated toric codes, we demonstrates an error threshold of 3.3% in the phenomenological noise model. Additionally, we propose a method to construct effective redundant stabilizer checks for single-shot error correction. Simulations show that BP decoding performs exceptionally well, even when the syndrome error rate greatly exceeds the data error rate. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: 14 pages, 9 figures, 1 table

arXiv:2310.12116 [pdf, ps, other]

doi 10.1109/TETC.2023.3326295

Distributed Indexing Schemes for k-Dominant Skyline Analytics on Uncertain Edge-IoT Data

Authors: Chuan-Chi Lai, Hsuan-Yu Lin, Chuan-Ming Liu

Abstract: Skyline queries typically search a Pareto-optimal set from a given data set to solve the corresponding multiobjective optimization problem. As the number of criteria increases, the skyline presumes excessive data items, which yield a meaningless result. To address this curse of dimensionality, we proposed a k-dominant skyline in which the number of skyline members was reduced by relaxing the restr… ▽ More Skyline queries typically search a Pareto-optimal set from a given data set to solve the corresponding multiobjective optimization problem. As the number of criteria increases, the skyline presumes excessive data items, which yield a meaningless result. To address this curse of dimensionality, we proposed a k-dominant skyline in which the number of skyline members was reduced by relaxing the restriction on the number of dimensions, considering the uncertainty of data. Specifically, each data item was associated with a probability of appearance, which represented the probability of becoming a member of the k-dominant skyline. As data items appear continuously in data streams, the corresponding k-dominant skyline may vary with time. Therefore, an effective and rapid mechanism of updating the k-dominant skyline becomes crucial. Herein, we proposed two time-efficient schemes, Middle Indexing (MI) and All Indexing (AI), for k-dominant skyline in distributed edge-computing environments, where irrelevant data items can be effectively excluded from the compute to reduce the processing duration. Furthermore, the proposed schemes were validated with extensive experimental simulations. The experimental results demonstrated that the proposed MI and AI schemes reduced the computation time by approximately 13% and 56%, respectively, compared with the existing method. △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: 13 pages, 8 figures, 12 tables, to appear in IEEE Transactions on Emerging Topics in Computing

arXiv:2310.11839 [pdf]

Neel tensor torque at the ferromagnet/antiferromagnet interface

Authors: Chao-Yao Yang, Sheng-Huai Chen, Chih-Hsiang Tseng, Chang-Yang Kuo, Hsiu-Hau Lin, Chih-Huang Lai

Abstract: Antiferromagnets (AFMs) exhibit spin arrangements with no net magnetization, positioning them as promising candidates for spintronics applications. While electrical manipulation of the single-crystal AFMs, composed of periodic spin configurations, is achieved recently, it remains a daunting challenge to characterize and to manipulate polycrystalline AFMs. Utilizing statistical analysis in data sci… ▽ More Antiferromagnets (AFMs) exhibit spin arrangements with no net magnetization, positioning them as promising candidates for spintronics applications. While electrical manipulation of the single-crystal AFMs, composed of periodic spin configurations, is achieved recently, it remains a daunting challenge to characterize and to manipulate polycrystalline AFMs. Utilizing statistical analysis in data science, we demonstrate that polycrystalline AFMs can be described using a real, symmetric, positive semi-definite, rank-two tensor, which we term the Neel tensor. This tensor introduces a unique spin torque, diverging from the conventional field-like and Slonczewski torques in spintronics devices. Remarkably, Neel tensors can be trained to retain a specific orientation, functioning as a form of working memory. This attribute enables zero-field spin-orbit-torque switching in trilayer devices featuring a heavy-metal/ferromagnet/AFM structure and is also consistent with the X-ray magnetic linear dichroism measurements. Our findings uncover hidden statistical patterns in polycrystalline AFMs and establishes the presence of Neel tensor torque, highlighting its potential to drive future spintronics innovations. △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: main text 18 pages, supplementary information 10 pages

arXiv:2310.07654 [pdf, other]

Audio-Visual Neural Syntax Acquisition

Authors: Cheng-I Jeff Lai, Freda Shi, Puyuan Peng, Yoon Kim, Kevin Gimpel, Shiyu Chang, Yung-Sung Chuang, Saurabhchand Bhati, David Cox, David Harwath, Yang Zhang, Karen Livescu, James Glass

Abstract: We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without eve… ▽ More We study phrase structure induction from visually-grounded speech. The core idea is to first segment the speech waveform into sequences of word segments, and subsequently induce phrase structure using the inferred segment-level continuous representations. We present the Audio-Visual Neural Syntax Learner (AV-NSL) that learns phrase structure by listening to audio and looking at images, without ever being exposed to text. By training on paired images and spoken captions, AV-NSL exhibits the capability to infer meaningful phrase structures that are comparable to those derived by naturally-supervised text parsers, for both English and German. Our findings extend prior work in unsupervised language acquisition from speech and grounded grammar induction, and present one approach to bridge the gap between the two topics. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Showing 1–50 of 508 results for author: Lai, C