subscribe to arXiv mailings

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance. △ Less

Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

arXiv:2406.07349 [pdf, other]

Erasing Radio Frequency Fingerprints via Active Adversarial Perturbation

Authors: Zhaoyi Lu, Wenchao Xu, Ming Tu, Xin Xie, Cunqing Hua, Nan Cheng

Abstract: Radio Frequency (RF) fingerprinting is to identify a wireless device from its uniqueness of the analog circuitry or hardware imperfections. However, unlike the MAC address which can be modified, such hardware feature is inevitable for the signal emitted to air, which can possibly reveal device whereabouts, e.g., a sniffer can use a pre-trained model to identify a nearby device when receiving its s… ▽ More Radio Frequency (RF) fingerprinting is to identify a wireless device from its uniqueness of the analog circuitry or hardware imperfections. However, unlike the MAC address which can be modified, such hardware feature is inevitable for the signal emitted to air, which can possibly reveal device whereabouts, e.g., a sniffer can use a pre-trained model to identify a nearby device when receiving its signal. Such fingerprint may expose critical private information, e.g., the associated upper-layer applications or the end-user. In this paper, we propose to erase such RF feature for wireless devices, which can prevent fingerprinting by actively perturbation from the signal perspective. Specifically, we consider a common RF fingerprinting scenario, where machine learning models are trained from pilot signal data for identification. A novel adversarial attack solution is designed to generate proper perturbations, whereby the perturbed pilot signal can hide the hardware feature and misclassify the model. We theoretically show that the perturbation would not affect the communication function within a tolerable perturbation threshold. We also implement the pilot signal fingerprinting and the proposed perturbation process in a practical LTE system. Extensive experiment results demonstrate that the RF fingerprints can be effectively erased to protect the user privacy. △ Less

Submitted 12 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

arXiv:2404.06674 [pdf, other]

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

Authors: Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yuping Wang, Yuxuan Wang, Mingbo Ma

Abstract: We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion… ▽ More We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit undesirable timbre leakage. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at \url{https://voiceshopai.github.io}. △ Less

Submitted 11 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2402.08251 [pdf, other]

doi 10.1109/SII58957.2024.10417611

Object Detection in Thermal Images Using Deep Learning for Unmanned Aerial Vehicles

Authors: Minh Dang Tu, Kieu Trang Le, Manh Duong Phung

Abstract: This work presents a neural network model capable of recognizing small and tiny objects in thermal images collected by unmanned aerial vehicles. Our model consists of three parts, the backbone, the neck, and the prediction head. The backbone is developed based on the structure of YOLOv5 combined with the use of a transformer encoder at the end. The neck includes a BI-FPN block combined with the us… ▽ More This work presents a neural network model capable of recognizing small and tiny objects in thermal images collected by unmanned aerial vehicles. Our model consists of three parts, the backbone, the neck, and the prediction head. The backbone is developed based on the structure of YOLOv5 combined with the use of a transformer encoder at the end. The neck includes a BI-FPN block combined with the use of a sliding window and a transformer to increase the information fed into the prediction head. The prediction head carries out the detection by evaluating feature maps with the Sigmoid function. The use of transformers with attention and sliding windows increases recognition accuracy while keeping the model at a reasonable number of parameters and computation requirements for embedded systems. Experiments conducted on public dataset VEDAI and our collected datasets show that our model has a higher accuracy than state-of-the-art methods such as ResNet, Faster RCNN, ComNet, ViT, YOLOv5, SMPNet, and DPNetV3. Experiments on the embedded computer Jetson AGX show that our model achieves a real-time computation speed with a stability rate of over 90%. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: Published in: 2024 IEEE/SICE International Symposium on System Integration (SII)

arXiv:2310.13028 [pdf, other]

Reliable Academic Conference Question Answering: A Study Based on Large Language Model

Authors: Zhiwei Huang, Long Jin, Junjie Wang, Mingchen Tu, Yin Hua, Zhiqiang Liu, Jiawei Meng, Huajun Chen, Wen Zhang

Abstract: The rapid growth of computer science has led to a proliferation of research presented at academic conferences, fostering global scholarly communication. Researchers consistently seek accurate, current information about these events at all stages. This data surge necessitates an intelligent question-answering system to efficiently address researchers' queries and ensure awareness of the latest adva… ▽ More The rapid growth of computer science has led to a proliferation of research presented at academic conferences, fostering global scholarly communication. Researchers consistently seek accurate, current information about these events at all stages. This data surge necessitates an intelligent question-answering system to efficiently address researchers' queries and ensure awareness of the latest advancements. The information of conferences is usually published on their official website, organized in a semi-structured way with a lot of text. To address this need, we have developed the ConferenceQA dataset for 7 diverse academic conferences with human annotations. Firstly, we employ a combination of manual and automated methods to organize academic conference data in a semi-structured JSON format. Subsequently, we annotate nearly 100 question-answer pairs for each conference. Each pair is classified into four different dimensions. To ensure the reliability of the data, we manually annotate the source of each answer. In light of recent advancements, Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks. They have demonstrated impressive capabilities in information-seeking question answering after instruction fine-tuning, and as such, we present our conference QA study based on LLM. Due to hallucination and outdated knowledge of LLMs, we adopt retrieval based methods to enhance LLMs' question-answering abilities. We have proposed a structure-aware retrieval method, specifically designed to leverage inherent structural information during the retrieval process. Empirical validation on the ConferenceQA dataset has demonstrated the effectiveness of this method. The dataset and code are readily accessible on https://github.com/zjukg/ConferenceQA. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: 10 pages, 4 figures, 2 tables

arXiv:2309.00597 [pdf, other]

The QUATRO Application Suite: Quantum Computing for Models of Human Cognition

Authors: Raghavendra Pradyumna Pothukuchi, Leon Lufkin, Yu Jun Shen, Alejandro Simon, Rome Thorstenson, Bernardo Eilert Trevisan, Michael Tu, Mudi Yang, Ben Foxman, Viswanatha Srinivas Pothukuchi, Gunnar Epping, Thi Ha Kyaw, Bryant J Jongkees, Yongshan Ding, Jerome R Busemeyer, Jonathan D Cohen, Abhishek Bhattacharjee

Abstract: Research progress in quantum computing has, thus far, focused on a narrow set of application domains. Expanding the suite of quantum application domains is vital for the discovery of new software toolchains and architectural abstractions. In this work, we unlock a new class of applications ripe for quantum computing research -- computational cognitive modeling. Cognitive models are critical to und… ▽ More Research progress in quantum computing has, thus far, focused on a narrow set of application domains. Expanding the suite of quantum application domains is vital for the discovery of new software toolchains and architectural abstractions. In this work, we unlock a new class of applications ripe for quantum computing research -- computational cognitive modeling. Cognitive models are critical to understanding and replicating human intelligence. Our work connects computational cognitive models to quantum computer architectures for the first time. We release QUATRO, a collection of quantum computing applications from cognitive models. The development and execution of QUATRO shed light on gaps in the quantum computing stack that need to be closed to ease programming and drive performance. Among several contributions, we propose and study ideas pertaining to quantum cloud scheduling (using data from gate- and annealing-based quantum computers), parallelization, and more. In the long run, we expect our research to lay the groundwork for more versatile quantum computer systems in the future. △ Less

Submitted 8 December, 2023; v1 submitted 1 September, 2023; originally announced September 2023.

arXiv:2308.10173 [pdf, other]

FoodGPT: A Large Language Model in Food Testing Domain with Incremental Pre-training and Knowledge Graph Prompt

Authors: Zhixiao Qi, Yijiong Yu, Meiqi Tu, Junyi Tan, Yongfeng Huang

Abstract: Currently, the construction of large language models in specific domains is done by fine-tuning on a base model. Some models also incorporate knowledge bases without the need for pre-training. This is because the base model already contains domain-specific knowledge during the pre-training process. We build a large language model for food testing. Unlike the above approach, a significant amount of… ▽ More Currently, the construction of large language models in specific domains is done by fine-tuning on a base model. Some models also incorporate knowledge bases without the need for pre-training. This is because the base model already contains domain-specific knowledge during the pre-training process. We build a large language model for food testing. Unlike the above approach, a significant amount of data in this domain exists in Scanning format for domain standard documents. In addition, there is a large amount of untrained structured knowledge. Therefore, we introduce an incremental pre-training step to inject this knowledge into a large language model. In this paper, we propose a method for handling structured knowledge and scanned documents in incremental pre-training. To overcome the problem of machine hallucination, we constructe a knowledge graph to serve as an external knowledge base for supporting retrieval in the large language model. It is worth mentioning that this paper is a technical report of our pre-release version, and we will report our specific experimental data in future versions. △ Less

Submitted 20 August, 2023; originally announced August 2023.

arXiv:2305.18566 [pdf]

doi 10.1142/S2251171723400068

The Scientific Investigation of Unidentified Aerial Phenomena (UAP) Using Multimodal Ground-Based Observatories

Authors: Wesley Andrés Watters, Abraham Loeb, Frank Laukien, Richard Cloete, Alex Delacroix, Sergei Dobroshinsky, Benjamin Horvath, Ezra Kelderman, Sarah Little, Eric Masson, Andrew Mead, Mitch Randall, Forrest Schultz, Matthew Szenher, Foteini Vervelidou, Abigail White, Angelique Ahlström, Carol Cleland, Spencer Dockal, Natasha Donahue, Mark Elowitz, Carson Ezell, Alex Gersznowicz, Nicholas Gold, Michael G. Hercz , et al. (13 additional authors not shown)

Abstract: (Abridged) Unidentified Aerial Phenomena (UAP) have resisted explanation and have received little formal scientific attention for 75 years. A primary objective of the Galileo Project is to build an integrated software and instrumentation system designed to conduct a multimodal census of aerial phenomena and to recognize anomalies. Here we present key motivations for the study of UAP and address hi… ▽ More (Abridged) Unidentified Aerial Phenomena (UAP) have resisted explanation and have received little formal scientific attention for 75 years. A primary objective of the Galileo Project is to build an integrated software and instrumentation system designed to conduct a multimodal census of aerial phenomena and to recognize anomalies. Here we present key motivations for the study of UAP and address historical objections to this research. We describe an approach for highlighting outlier events in the high-dimensional parameter space of our census measurements. We provide a detailed roadmap for deciding measurement requirements, as well as a science traceability matrix (STM) for connecting sought-after physical parameters to observables and instrument requirements. We also discuss potential strategies for deciding where to locate instruments for development, testing, and final deployment. Our instrument package is multimodal and multispectral, consisting of (1) wide-field cameras in multiple bands for targeting and tracking of aerial objects and deriving their positions and kinematics using triangulation; (2) narrow-field instruments including cameras for characterizing morphology, spectra, polarimetry, and photometry; (3) passive multistatic arrays of antennas and receivers for radar-derived range and kinematics; (4) radio spectrum analyzers to measure radio and microwave emissions; (5) microphones for sampling acoustic emissions in the infrasonic through ultrasonic frequency bands; and (6) environmental sensors for characterizing ambient conditions (temperature, pressure, humidity, and wind velocity), as well as quasistatic electric and magnetic fields, and energetic particles. The use of multispectral instruments and multiple sensor modalities will help to ensure that artifacts are recognized and that true detections are corroborated and verifiable. △ Less

Submitted 31 May, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

Comments: This paper is published in the Journal of Astronomical Instrumentation, 12(1), 2340006 (2023) https://doi.org/10.1142/S2251171723400068

Journal ref: Journal of Astronomical Instrumentation, 12(1), 2340006 (2023)

arXiv:2305.18551 [pdf]

doi 10.1142/S2251171723400056

Multi-Band Acoustic Monitoring of Aerial Signatures

Authors: Andrew Mead, Sarah Little, Paul Sail, Michelle Tu, Wesley Andrés Watters, Abigail White, Richard Cloete

Abstract: The Galileo Project's acoustic monitoring, omni-directional system (AMOS) aids in the detection and characterization of aerial phenomena. It uses a multi-band microphone suite spanning infrasonic to ultrasonic frequencies, providing an independent signal modality for validation and characterization of detected objects. The system utilizes infrasonic, audible, and ultrasonic systems to cover a wide… ▽ More The Galileo Project's acoustic monitoring, omni-directional system (AMOS) aids in the detection and characterization of aerial phenomena. It uses a multi-band microphone suite spanning infrasonic to ultrasonic frequencies, providing an independent signal modality for validation and characterization of detected objects. The system utilizes infrasonic, audible, and ultrasonic systems to cover a wide range of sounds produced by both natural and man-made aerial phenomena. Sound signals from aerial objects can be captured given certain conditions, such as when the sound level is above ambient noise and isn't excessively distorted by its transmission path. Findings suggest that audible sources can be detected up to 1 km away, infrasonic sources can be detected over much longer distances, and ultrasonic at shorter ones. Initial data collected from aircraft recordings with spectral analysis will help develop algorithms and software for quick identification of known aircraft. Future work will involve multi-sensor arrays for sound localization, larger data sets analysis, and incorporation of machine learning and AI for detection and identification of more types of phenomena in all frequency bands. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Journal ref: Journal of Astronomical Instrumentation, 12(1), 2340005 (2023)

arXiv:2305.15719 [pdf, other]

Efficient Neural Music Generation

Authors: Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan Wang

Abstract: Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real… ▽ More Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/. △ Less

Submitted 25 May, 2023; originally announced May 2023.

arXiv:2305.11576 [pdf, other]

Language-universal phonetic encoder for low-resource speech recognition

Authors: Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

Abstract: Multilingual training is effective in improving low-resource ASR, which may partially be explained by phonetic representation sharing between languages. In end-to-end (E2E) ASR systems, graphemes are often used as basic modeling units, however graphemes may not be ideal for multilingual phonetic sharing. In this paper, we leverage International Phonetic Alphabet (IPA) based language-universal phon… ▽ More Multilingual training is effective in improving low-resource ASR, which may partially be explained by phonetic representation sharing between languages. In end-to-end (E2E) ASR systems, graphemes are often used as basic modeling units, however graphemes may not be ideal for multilingual phonetic sharing. In this paper, we leverage International Phonetic Alphabet (IPA) based language-universal phonetic model to improve low-resource ASR performances, for the first time within the attention encoder-decoder architecture. We propose an adaptation method on the phonetic IPA model to further improve the proposed approach on extreme low-resource languages. Experiments carried out on the open-source MLS corpus and our internal databases show our approach outperforms baseline monolingual models and most state-of-the-art works. Our main approach and adaptation are effective on extremely low-resource languages, even within domain- and language-mismatched scenarios. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: Accepted for publication in INTERSPEECH 2023

arXiv:2305.11569 [pdf, ps, other]

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

Authors: Siyuan Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

Abstract: We improve low-resource ASR by integrating the ideas of multilingual training and self-supervised learning. Concretely, we leverage an International Phonetic Alphabet (IPA) multilingual model to create frame-level pseudo labels for unlabeled speech, and use these pseudo labels to guide hidden-unit BERT (HuBERT) based speech pretraining in a phonetically-informed manner. The experiments on the Mult… ▽ More We improve low-resource ASR by integrating the ideas of multilingual training and self-supervised learning. Concretely, we leverage an International Phonetic Alphabet (IPA) multilingual model to create frame-level pseudo labels for unlabeled speech, and use these pseudo labels to guide hidden-unit BERT (HuBERT) based speech pretraining in a phonetically-informed manner. The experiments on the Multilingual Speech (MLS) Corpus show that the proposed approach consistently outperforms the standard HuBERT on all the target languages. Moreover, on 3 of the 4 languages, comparing to the standard HuBERT, the approach performs better, meanwhile is able to save supervised training data by 1.5k hours (75%) at most. Our approach outperforms most of the state of the arts, with much less pretraining data in terms of hours and language diversity. Compared to XLSR-53 and a retraining based multilingual method, our approach performs better with full and limited finetuning data scenarios. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: Accepted for publication in INTERSPEECH 2023

arXiv:2305.05226 [pdf, other]

Multi-Teacher Knowledge Distillation For Text Image Machine Translation

Authors: Cong Ma, Yaping Zhang, Mei Tu, Yang Zhao, Yu Zhou, Chengqing Zong

Abstract: Text image machine translation (TIMT) has been widely used in various real-world applications, which translates source language texts in images into another target language sentence. Existing methods on TIMT are mainly divided into two categories: the recognition-then-translation pipeline model and the end-to-end model. However, how to transfer knowledge from the pipeline model into the end-to-end… ▽ More Text image machine translation (TIMT) has been widely used in various real-world applications, which translates source language texts in images into another target language sentence. Existing methods on TIMT are mainly divided into two categories: the recognition-then-translation pipeline model and the end-to-end model. However, how to transfer knowledge from the pipeline model into the end-to-end model remains an unsolved problem. In this paper, we propose a novel Multi-Teacher Knowledge Distillation (MTKD) method to effectively distillate knowledge into the end-to-end TIMT model from the pipeline model. Specifically, three teachers are utilized to improve the performance of the end-to-end TIMT model. The image encoder in the end-to-end TIMT model is optimized with the knowledge distillation guidance from the recognition teacher encoder, while the sequential encoder and decoder are improved by transferring knowledge from the translation sequential and decoder teacher models. Furthermore, both token and sentence-level knowledge distillations are incorporated to better boost the translation performance. Extensive experimental results show that our proposed MTKD effectively improves the text image translation performance and outperforms existing end-to-end and pipeline models with fewer parameters and less decoding time, illustrating that MTKD can take advantage of both pipeline and end-to-end models. △ Less

Submitted 9 May, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: Accepted at The 17th International Conference on Document Analysis and Recognition (ICDAR 2023)

arXiv:2305.05166 [pdf, other]

E2TIMT: Efficient and Effective Modal Adapter for Text Image Machine Translation

Authors: Cong Ma, Yaping Zhang, Mei Tu, Yang Zhao, Yu Zhou, Chengqing Zong

Abstract: Text image machine translation (TIMT) aims to translate texts embedded in images from one source language to another target language. Existing methods, both two-stage cascade and one-stage end-to-end architectures, suffer from different issues. The cascade models can benefit from the large-scale optical character recognition (OCR) and MT datasets but the two-stage architecture is redundant. The en… ▽ More Text image machine translation (TIMT) aims to translate texts embedded in images from one source language to another target language. Existing methods, both two-stage cascade and one-stage end-to-end architectures, suffer from different issues. The cascade models can benefit from the large-scale optical character recognition (OCR) and MT datasets but the two-stage architecture is redundant. The end-to-end models are efficient but suffer from training data deficiency. To this end, in our paper, we propose an end-to-end TIMT model fully making use of the knowledge from existing OCR and MT datasets to pursue both an effective and efficient framework. More specifically, we build a novel modal adapter effectively bridging the OCR encoder and MT decoder. End-to-end TIMT loss and cross-modal contrastive loss are utilized jointly to align the feature distribution of the OCR and MT tasks. Extensive experiments show that the proposed method outperforms the existing two-stage cascade models and one-stage end-to-end models with a lighter and faster architecture. Furthermore, the ablation studies verify the generalization of our method, where the proposed modal adapter is effective to bridge various OCR and MT models. △ Less

Submitted 9 May, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: Accepted at The 17th International Conference on Document Analysis and Recognition (ICDAR 2023)

arXiv:2305.03949 [pdf, other]

Label-Free Multi-Domain Machine Translation with Stage-wise Training

Authors: Fan Zhang, Mei Tu, Sangha Kim, Song Liu, Jinyao Yan

Abstract: Most multi-domain machine translation models rely on domain-annotated data. Unfortunately, domain labels are usually unavailable in both training processes and real translation scenarios. In this work, we propose a label-free multi-domain machine translation model which requires only a few or no domain-annotated data in training and no domain labels in inference. Our model is composed of three par… ▽ More Most multi-domain machine translation models rely on domain-annotated data. Unfortunately, domain labels are usually unavailable in both training processes and real translation scenarios. In this work, we propose a label-free multi-domain machine translation model which requires only a few or no domain-annotated data in training and no domain labels in inference. Our model is composed of three parts: a backbone model, a domain discriminator taking responsibility to discriminate data from different domains, and a set of experts that transfer the decoded features from generic to specific. We design a stage-wise training strategy and train the three parts sequentially. To leverage the extra domain knowledge and improve the training stability, in the discriminator training stage, domain differences are modeled explicitly with clustering and distilled into the discriminator through a multi-classification task. Meanwhile, the Gumbel-Max sampling is adopted as the routing scheme in the expert training stage to achieve the balance of each expert in specialization and generalization. Experimental results on the German-to-English translation task show that our model significantly improves BLEU scores on six different domains and even outperforms most of the models trained with domain-annotated data. △ Less

Submitted 6 May, 2023; originally announced May 2023.

arXiv:2303.09279 [pdf, other]

Privacy-Preserving Video Conferencing via Thermal-Generative Images

Authors: Sheng-Yang Chiu, Yu-Ting Huang, Chieh-Ting Lin, Yu-Chee Tseng, Jen-Jee Chen, Meng-Hsuan Tu, Bo-Chen Tung, YuJou Nieh

Abstract: Due to the COVID-19 epidemic, video conferencing has evolved as a new paradigm of communication and teamwork. However, private and personal information can be easily leaked through cameras during video conferencing. This includes leakage of a person's appearance as well as the contents in the background. This paper proposes a novel way of using online low-resolution thermal images as conditions to… ▽ More Due to the COVID-19 epidemic, video conferencing has evolved as a new paradigm of communication and teamwork. However, private and personal information can be easily leaked through cameras during video conferencing. This includes leakage of a person's appearance as well as the contents in the background. This paper proposes a novel way of using online low-resolution thermal images as conditions to guide the synthesis of RGB images, bringing a promising solution for real-time video conferencing when privacy leakage is a concern. SPADE-SR (Spatially-Adaptive De-normalization with Self Resampling), a variant of SPADE, is adopted to incorporate the spatial property of a thermal heatmap and the non-thermal property of a normal, privacy-free pre-recorded RGB image provided in a form of latent code. We create a PAIR-LRT-Human (LRT = Low-Resolution Thermal) dataset to validate our claims. The result enables a convenient way of video conferencing where users no longer need to groom themselves and tidy up backgrounds for a short meeting. Additionally, it allows a user to switch to a different appearance and background during a conference. △ Less

Submitted 28 March, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: Accepted for publication at IEEE International Conference on Robotics and Automation (ICRA) 2023

arXiv:2301.00066 [pdf, other]

Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition

Authors: Yukun Feng, Ming Tu, Rui Xia, Chuanzeng Huang, Yuxuan Wang

Abstract: Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a n… ▽ More Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. With intensive experiments on Chinese and English data sets, our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate. This is achieved without impact on the decoding efficiency. Overall, we demonstrate the effectiveness of our proposed method in boosting the ASR decoding performance, especially for long-tail tokens. △ Less

Submitted 30 December, 2022; originally announced January 2023.

Comments: Submitted to ICASSP 2023

arXiv:2212.13899 [pdf, other]

doi 10.1007/s10506-022-09341-8

Attentive Deep Neural Networks for Legal Document Retrieval

Authors: Ha-Thanh Nguyen, Manh-Kien Phi, Xuan-Bach Ngo, Vu Tran, Le-Minh Nguyen, Minh-Phuong Tu

Abstract: Legal text retrieval serves as a key component in a wide range of legal text processing tasks such as legal question answering, legal case entailment, and statute law retrieval. The performance of legal text retrieval depends, to a large extent, on the representation of text, both query and legal documents. Based on good representations, a legal text retrieval model can effectively match the query… ▽ More Legal text retrieval serves as a key component in a wide range of legal text processing tasks such as legal question answering, legal case entailment, and statute law retrieval. The performance of legal text retrieval depends, to a large extent, on the representation of text, both query and legal documents. Based on good representations, a legal text retrieval model can effectively match the query to its relevant documents. Because legal documents often contain long articles and only some parts are relevant to queries, it is quite a challenge for existing models to represent such documents. In this paper, we study the use of attentive neural network-based text representation for statute law document retrieval. We propose a general approach using deep neural networks with attention mechanisms. Based on it, we develop two hierarchical architectures with sparse attention to represent long sentences and articles, and we name them Attentive CNN and Paraformer. The methods are evaluated on datasets of different sizes and characteristics in English, Japanese, and Vietnamese. Experimental results show that: i) Attentive neural methods substantially outperform non-neural methods in terms of retrieval performance across datasets and languages; ii) Pretrained transformer-based models achieve better accuracy on small datasets at the cost of high computational complexity while lighter weight Attentive CNN achieves better accuracy on large datasets; and iii) Our proposed Paraformer outperforms state-of-the-art methods on COLIEE dataset, achieving the highest recall and F2 scores in the top-N retrieval task. △ Less

Submitted 12 December, 2022; originally announced December 2022.

Comments: Preprint version. The official version will be published in Artificial Intelligence and Law journal

arXiv:2210.15158 [pdf, other]

Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

Authors: Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yuping Wang, Yuxuan Wang

Abstract: Streaming voice conversion (VC) is the task of converting the voice of one person to another in real-time. Previous streaming VC methods use phonetic posteriorgrams (PPGs) extracted from automatic speech recognition (ASR) systems to represent speaker-independent information. However, PPGs lack the prosody and vocalization information of the source speaker, and streaming PPGs contain undesired leak… ▽ More Streaming voice conversion (VC) is the task of converting the voice of one person to another in real-time. Previous streaming VC methods use phonetic posteriorgrams (PPGs) extracted from automatic speech recognition (ASR) systems to represent speaker-independent information. However, PPGs lack the prosody and vocalization information of the source speaker, and streaming PPGs contain undesired leaked timbre of the source speaker. In this paper, we propose to use intermediate bottleneck features (IBFs) to replace PPGs. VC systems trained with IBFs retain more prosody and vocalization information of the source speaker. Furthermore, we propose a non-streaming teacher guidance (TG) framework that addresses the timbre leakage problem. Experiments show that our proposed IBFs and the TG framework achieve a state-of-the-art streaming VC naturalness of 3.85, a content consistency of 3.77, and a timbre similarity of 3.77 under a future receptive field of 160 ms which significantly outperform previous streaming VC systems. △ Less

Submitted 26 October, 2022; originally announced October 2022.

Comments: The paper has been submitted to ICASSP2023

arXiv:2210.03887 [pdf, other]

Improving End-to-End Text Image Translation From the Auxiliary Text Translation Task

Authors: Cong Ma, Yaping Zhang, Mei Tu, Xu Han, Linghui Wu, Yang Zhao, Yu Zhou

Abstract: End-to-end text image translation (TIT), which aims at translating the source language embedded in images to the target language, has attracted intensive attention in recent research. However, data sparsity limits the performance of end-to-end text image translation. Multi-task learning is a non-trivial way to alleviate this problem via exploring knowledge from complementary related tasks. In this… ▽ More End-to-end text image translation (TIT), which aims at translating the source language embedded in images to the target language, has attracted intensive attention in recent research. However, data sparsity limits the performance of end-to-end text image translation. Multi-task learning is a non-trivial way to alleviate this problem via exploring knowledge from complementary related tasks. In this paper, we propose a novel text translation enhanced text image translation, which trains the end-to-end model with text translation as an auxiliary task. By sharing model parameters and multi-task training, our model is able to take full advantage of easily-available large-scale text parallel corpus. Extensive experimental results show our proposed method outperforms existing end-to-end methods, and the joint multi-task learning with both text translation and recognition tasks achieves better results, proving translation and recognition auxiliary tasks are complementary. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: Accepted at the 26TH International Conference on Pattern Recognition (ICPR 2022)

arXiv:2209.10475 [pdf, other]

Designing PIDs for Reproducible Science Using Time-Series Data

Authors: Wen Ting Maria Tu, Stephen Makonin

Abstract: As part of the investigation done by the IEEE Standards Association P2957 Working Group, called Big Data Governance and Metadata Management, the use of persistent identifiers (PIDs) is looked at for tackling the problem of reproducible research and science. This short paper proposes a preliminary method using PIDs to reproduce research results using time-series data. Furthermore, we feel it is pos… ▽ More As part of the investigation done by the IEEE Standards Association P2957 Working Group, called Big Data Governance and Metadata Management, the use of persistent identifiers (PIDs) is looked at for tackling the problem of reproducible research and science. This short paper proposes a preliminary method using PIDs to reproduce research results using time-series data. Furthermore, we feel it is possible to use the methodology and design for other types of datasets. △ Less

Submitted 21 September, 2022; originally announced September 2022.

Comments: Submitted to MTSR 2022 - 16th International Conference on Metadata and Semantics Research

arXiv:2208.04805 [pdf, other]

Exploiting anisotropic Rashba effects on real-time photocurrents and spin polarization for transient symmetry breaking

Authors: Matisse Wei-Yuan Tu, Jyh-Pin Chou, Chih-Wei Luo

Abstract: We theoretically investigate the real-time transient responses of a two-dimensional (2D) electron gas with anisotropic Rashba spin-orbit coupling (SOC) to laser pulses. Through explicitly monitoring the time-dependent photocurrents and spin polarization under different linear polarizations of the laser pulse, we find that the transient breaking of the mirror symmetry in combination with the anisot… ▽ More We theoretically investigate the real-time transient responses of a two-dimensional (2D) electron gas with anisotropic Rashba spin-orbit coupling (SOC) to laser pulses. Through explicitly monitoring the time-dependent photocurrents and spin polarization under different linear polarizations of the laser pulse, we find that the transient breaking of the mirror symmetry in combination with the anisotropy of the Rashba SOC results in significant distinction between the charge-mediated and the spin-mediated contributions to the photocurrents. Such distinction is obtained by analyzing the dependence of the symmetry-breaking induced (transverse) components of the photocurrents on the linear polarization angle of the laser pulse. This suggests a possibility of inferring spin-mediated processes in photocurrents without the use of circularly polarized lights. Moreover, the interplay between transient symmetry breaking and the anisotropy of the Rashba SOC also leads to transiently nonzero spin polarization components that are otherwise zero in the steady-state limit and the linear response regime. Especially, the out-of-plane spin polarization component can be induced or turned off by controlling the relative orientation of the linear polarization with respect to the symmetry axis of the 2D electronic system, without involving material-intrinsic magnetization effects. Our findings demonstrate the efficacy of a particular coordination between the polarization of the ultrafast laser pulses and the spatial symmetry of the electronic materials in directing the real-time charge and the spin responses that are fundamental to the development of ultrafast spintronics in solid states. △ Less

Submitted 9 August, 2022; originally announced August 2022.

arXiv:2207.08525 [pdf, other]

Angular Gap: Reducing the Uncertainty of Image Difficulty through Model Calibration

Authors: Bohua Peng, Mobarakol Islam, Mei Tu

Abstract: Curriculum learning needs example difficulty to proceed from easy to hard. However, the credibility of image difficulty is rarely investigated, which can seriously affect the effectiveness of curricula. In this work, we propose Angular Gap, a measure of difficulty based on the difference in angular distance between feature embeddings and class-weight embeddings built by hyperspherical learning. To… ▽ More Curriculum learning needs example difficulty to proceed from easy to hard. However, the credibility of image difficulty is rarely investigated, which can seriously affect the effectiveness of curricula. In this work, we propose Angular Gap, a measure of difficulty based on the difference in angular distance between feature embeddings and class-weight embeddings built by hyperspherical learning. To ascertain difficulty estimation, we introduce class-wise model calibration, as a post-training technique, to the learnt hyperbolic space. This bridges the gap between probabilistic model calibration and angular distance estimation of hyperspherical learning. We show the superiority of our calibrated Angular Gap over recent difficulty metrics on CIFAR10-H and ImageNetV2. We further propose Angular Gap based curriculum learning for unsupervised domain adaptation that can translate from learning easy samples to mining hard samples. We combine this curriculum with a state-of-the-art self-training method, Cycle Self Training (CST). The proposed Curricular CST learns robust representations and outperforms recent baselines on Office31 and VisDA 2017. △ Less

Submitted 18 July, 2022; originally announced July 2022.

Comments: 13 pages

arXiv:2205.08036 [pdf, ps, other]

On Semiparametric Efficiency of an Emerging Class of Regression Models for Between-subject Attributes

Authors: Jinyuan Liu, Tuo Lin, Tian Chen, Xinlian Zhang, Xin M. Tu

Abstract: The semiparametric regression models have attracted increasing attention owing to their robustness compared to their parametric counterparts. This paper discusses the efficiency bound for functional response models (FRM), an emerging class of semiparametric regression that serves as a timely solution for research questions involving pairwise observations. This new paradigm is especially appealing… ▽ More The semiparametric regression models have attracted increasing attention owing to their robustness compared to their parametric counterparts. This paper discusses the efficiency bound for functional response models (FRM), an emerging class of semiparametric regression that serves as a timely solution for research questions involving pairwise observations. This new paradigm is especially appealing to reduce astronomical data dimensions for those arising from wearable devices and high-throughput technology, such as microbiome Beta-diversity, viral genetic linkage, single-cell RNA sequencing, etc. Despite the growing applications, the efficiency of their estimators has not been investigated carefully due to the extreme difficulty to address the inherent correlations among pairs. Leveraging the Hilbert-space-based semiparametric efficiency theory for classical within-subject attributes, this manuscript extends such asymptotic efficiency into the broader regression involving between-subject attributes and pinpoints the most efficient estimator, which leads to a sensitive signal-detection in practice. With pairwise outcomes burgeoning immensely as effective dimension-reduction summaries, the established theory will not only fill the critical gap in identifying the most efficient semiparametric estimator but also propel wide-ranging implementations of this new paradigm for between-subject attributes. △ Less

Submitted 16 May, 2022; originally announced May 2022.

arXiv:2110.03347 [pdf, ps, other]

Cloning one's voice using very limited data in the wild

Authors: Dongyang Dai, Yuanzhe Chen, Li Chen, Ming Tu, Lu Liu, Rui Xia, Qiao Tian, Yuping Wang, Yuxuan Wang

Abstract: With the increasing popularity of speech synthesis products, the industry has put forward more requirements for personalized speech synthesis: (1) How to use low-resource, easily accessible data to clone a person's voice. (2) How to clone a person's voice while controlling the style and prosody. To solve the above two problems, we proposed the Hieratron model framework in which the prosody and tim… ▽ More With the increasing popularity of speech synthesis products, the industry has put forward more requirements for personalized speech synthesis: (1) How to use low-resource, easily accessible data to clone a person's voice. (2) How to clone a person's voice while controlling the style and prosody. To solve the above two problems, we proposed the Hieratron model framework in which the prosody and timbre are modeled separately using two modules, therefore, the independent control of timbre and the other characteristics of audio can be achieved while generating speech. The practice shows that, for very limited target speaker data in the wild, Hieratron has obvious advantages over the traditional method, in addition to controlling the style and language of the generated speech, the mean opinion score on speech quality of the generated speech has also been improved by more than 0.2 points. △ Less

Submitted 8 October, 2021; v1 submitted 7 October, 2021; originally announced October 2021.

arXiv:2104.14147 [pdf, other]

doi 10.1088/2053-1583/ac186e

Revealing the non-adiabatic and non-Abelian multiple-band effects via anisotropic valley Hall conduction in bilayer graphene

Authors: Ci Li, Matisse Wei-Yuan Tu, Wang Yao

Abstract: Many quantum materials of interest, ex., bilayer graphene, possess a number of closely spaced but not fully degenerate bands near the Fermi level, where the coupling to the far detuned remote bands can induce Berry curvatures of the non-Abelian character in this active multiple-band manifold for transport effects. Under finite electric fields, non-adiabatic interband transition processes are expec… ▽ More Many quantum materials of interest, ex., bilayer graphene, possess a number of closely spaced but not fully degenerate bands near the Fermi level, where the coupling to the far detuned remote bands can induce Berry curvatures of the non-Abelian character in this active multiple-band manifold for transport effects. Under finite electric fields, non-adiabatic interband transition processes are expected to play significant roles in the associated Hall conduction. Here through an exemplified study on the valley Hall conduction in AB-stacked bilayer graphene, we show that the contribution arising from non-adiabatic transitions around the bands near the Fermi energy to the Hall current is not only quantitatively about an order-of-magnitude larger than the contribution due to adiabatic inter-manifold transition with the non-Abelian Berry curvatures. Due to the trigonal warping, the former also displays an anisotropic response to the orientation of the applied electric field that is qualitatively distinct from that of the latter. We further show that these anisotropic responses also reveal the essential differences between the diagonal and off-diagonal elements of the non-Abelian Berry curvature matrix in terms of their contributions to the Hall currents. We provide a physically intuitive understanding of the origin of distinct anisotropic features from different Hall current contributions, in terms of band occupations and interband coherence. This then points to the generalization beyond the specific example of bilayer graphenes. △ Less

Submitted 14 July, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

Comments: 9 pages, 5 figures

Journal ref: Ci Li, Matisse Wei-Yuan Tu, and Wang Yao, 2D Mater. 8 045012 (2021)

arXiv:2103.02575 [pdf]

Quantifying Photoinduced Polaronic Distortions in Inorganic Lead Halide Perovskites Nanocrystals

Authors: Oliviero Cannelli, Nicola Colonna, Michele Puppin, Thomas Rossi, Dominik Kinschel, Ludmila Leroy, Janina Loeffler, Anne Marie March, Gilles Doumy, Andre Al Haddad, Ming-Feng Tu, Yoshiaki Kumagai, Donald Walko, Grigory Smolentsev, Franziska Krieg, Simon C. Boehme, Maksym V. Kovalenko, Majed Chergui, Giulia F. Mancini

Abstract: The development of next generation perovskite-based optoelectronic devices relies critically on the understanding of the interaction between charge carriers and the polar lattice in out-of-equilibrium conditions. While it has become increasingly evident for CsPbBr3 perovskites that the Pb-Br framework flexibility plays a key role in their light-activated functionality, the corresponding local stru… ▽ More The development of next generation perovskite-based optoelectronic devices relies critically on the understanding of the interaction between charge carriers and the polar lattice in out-of-equilibrium conditions. While it has become increasingly evident for CsPbBr3 perovskites that the Pb-Br framework flexibility plays a key role in their light-activated functionality, the corresponding local structural rearrangement has not yet been unambiguously identified. In this work, we demonstrate that the photoinduced lattice changes in the system are due to a specific polaronic distortion, associated with the activation of a longitudinal optical phonon mode at 18 meV by electron-phonon coupling, and we quantify the associated structural changes with atomic-level precision. Key to this achievement is the combination of time-resolved and temperature-dependent studies at Br K-edge and Pb L3-edge X-ray absorption with refined ab-initio simulations, which fully account for the screened core-hole final state effects on the X-ray absorption spectra. From the temporal kinetics, we show that carrier recombination reversibly unlocks the structural deformation at both Br and Pb sites. The comparison with the temperature-dependent XAS results rules out thermal effects as the primary source of distortion of the Pb-Br bonding motif during photoexcitation. Our work provides a comprehensive description of the CsPbBr3 perovskites photophysics, offering novel insights on the light-induced response of the system and its exceptional optoelectronic properties. △ Less

Submitted 3 March, 2021; originally announced March 2021.

Comments: Main: 27 pages, 4 figures SI: 16 pages, 8 figures

arXiv:2102.10818 [pdf]

doi 10.1126/sciadv.abg8094

Spin Photovoltaic Effect in Magnetic van der Waals Heterostructures

Authors: Tiancheng Song, Eric Anderson, Matisse Wei-Yuan Tu, Kyle Seyler, Takashi Taniguchi, Kenji Watanabe, Michael A. McGuire, Xiaosong Li, Ting Cao, Di Xiao, Wang Yao, Xiaodong Xu

Abstract: The development of van der Waals (vdW) crystals and their heterostructures has created a fascinating platform for exploring optoelectronic properties in the two-dimensional (2D) limit. With the recent discovery of 2D magnets, the control of the spin degree of freedom can be integrated to realize 2D spin-optoelectronics with spontaneous time-reversal symmetry breaking. Here, we report spin photovol… ▽ More The development of van der Waals (vdW) crystals and their heterostructures has created a fascinating platform for exploring optoelectronic properties in the two-dimensional (2D) limit. With the recent discovery of 2D magnets, the control of the spin degree of freedom can be integrated to realize 2D spin-optoelectronics with spontaneous time-reversal symmetry breaking. Here, we report spin photovoltaic effects in vdW heterostructures of atomically thin magnet chromium triiodide (CrI3) sandwiched by graphene contacts. In the absence of a magnetic field, the photocurrent displays a distinct dependence on light helicity, which can be tuned by varying the magnetic states and photon energy. Circular polarization-resolved absorption measurements reveal that these observations originate from magnetic-order-coupled and thus helicity-dependent charge-transfer exciton states. The photocurrent displays multiple plateaus as the magnetic field is swept, which are associated with different spin configurations enabled by the layered antiferromagnetism and spin-flip transitions in CrI3. Remarkably, giant photo-magnetocurrent is observed, which tends to infinity for a small applied bias. Our results pave the way to explore emergent photo-spintronics by engineering magnetic vdW heterostructures. △ Less

Submitted 22 February, 2021; originally announced February 2021.

arXiv:2009.06849 [pdf]

doi 10.1088/0256-307X/37/10/107201

Giant spin transfer torque in atomically thin magnetic bilayers

Authors: Weihao Cao, Matisse Wei-Yuan Tu, Jiang Xiao, Wang Yao

Abstract: In cavity quantum electrodynamics, the multiple reflections of a photon between two mirrors defining a cavity is exploited to enhance the light-coupling of an intra-cavity atom. We show that this paradigm for enhancing the interaction of a flying particle with a localized object can be generalized to spintronics based on van der Waals 2D magnets. Upon tunneling through a magnetic bilayer, we find… ▽ More In cavity quantum electrodynamics, the multiple reflections of a photon between two mirrors defining a cavity is exploited to enhance the light-coupling of an intra-cavity atom. We show that this paradigm for enhancing the interaction of a flying particle with a localized object can be generalized to spintronics based on van der Waals 2D magnets. Upon tunneling through a magnetic bilayer, we find the spin transfer torques per electron incidence can become orders of magnitude larger than $\hbar/2$, made possible by electron's multi-reflection path through the ferromagnetic monolayers as an intermediate of their angular momentum transfer. Over a broad energy range around the tunneling resonances, the damping-like spin transfer torque per electron tunneling features a universal value of $\frac{\hbar}{2} \tan{\fracθ{2}}$, depending only on the angle $θ$ between the magnetizations. These findings expand the scope of magnetization manipulations for high-performance and high-density storage based on van der Waals magnets. △ Less

Submitted 27 September, 2020; v1 submitted 14 September, 2020; originally announced September 2020.

Comments: Published as an Express Letter on Chinese Physics Letters

Journal ref: Chin. Phys. Lett. 37, 107201 (2020)

arXiv:2005.14286 [pdf, other]

Generative network complex for the automated generation of druglike molecules

Authors: Kaifu Gao, Duc D Nguyen, Meihua Tu, Guo-Wei Wei

Abstract: Current drug discovery is expensive and time-consuming. It remains a challenging task to create a wide variety of novel compounds with desirable pharmacological properties and cheaply available to low-income people. In this work, we develop a generative network complex (GNC) to generate new drug-like molecules based on the multi-property optimization via the gradient descent in the latent space of… ▽ More Current drug discovery is expensive and time-consuming. It remains a challenging task to create a wide variety of novel compounds with desirable pharmacological properties and cheaply available to low-income people. In this work, we develop a generative network complex (GNC) to generate new drug-like molecules based on the multi-property optimization via the gradient descent in the latent space of an autoencoder. In our GNC, both multiple chemical properties and similarity scores are optimized to generate and predict drug-like molecules with desired chemical properties. To further validate the reliability of the predictions, these molecules are reevaluated and screened by independent 2D fingerprint-based predictors to come up with a few hundreds of new drug candidates. As a demonstration, we apply our GNC to generate a large number of new BACE1 inhibitors, as well as thousands of novel alternative drug candidates for eight existing market drugs, including Ceritinib, Ribociclib, Acalabrutinib, Idelalisib, Dabrafenib, Macimorelin, Enzalutamide, and Panobinostat. △ Less

Submitted 28 May, 2020; originally announced May 2020.

Comments: 27 pages, 2 tables and 19 figures

arXiv:2004.06279 [pdf, other]

doi 10.1103/PhysRevB.102.045423

Theory of wavepacket transport under narrow gaps and spatial textures: non-adiabaticity and semiclassicality

Authors: Matisse Wei-Yuan Tu, Ci Li, Wang Yao

Abstract: We generalise the celebrated semiclassical wavepacket approach from the adiabatic to the non-adiabatic regime. A unified description covering both of these regimes is particularly desired for systems with spatially varying band structures where band gaps of various sizes are simultaneously present, e.g. in moiré patterns. For a single wavepacket, alternative to the previous derivation by Lagrangia… ▽ More We generalise the celebrated semiclassical wavepacket approach from the adiabatic to the non-adiabatic regime. A unified description covering both of these regimes is particularly desired for systems with spatially varying band structures where band gaps of various sizes are simultaneously present, e.g. in moiré patterns. For a single wavepacket, alternative to the previous derivation by Lagrangian variational approach, we show that the same semiclassical equations of motion can be obtained by introducing a spatial-texture-induced force operator similar to the Ehrenfest theorem. For semiclassically computing the current, the ensemble of wavepackets based on adiabatic dynamics is shown to well correspond to a phase-space fluid for which the fluid's mass and velocity are two distinguishable properties. This distinction is not inherited to the ensemble of wavepackets with the non-adiabatic dynamics. We extend the adiabatic kinetic theory to the non-adiabatic regime by taking into account decoherence, whose joint action with electric field favours certain form of inter-band coherence. The steady-state density matrix as a function of the phase-space variables is then phenomenologically obtained for calculating the transport current. The result, applicable with a finite electric field, expectedly reproduces the known adiabatic limit by taking the electric field to be infinitesimal, and therefore attains a unified description from the adiabatic to the non-adiabatic situations. △ Less

Submitted 13 April, 2020; originally announced April 2020.

Comments: 16 pages, 1 figure

Journal ref: Phys. Rev. B 102, 045423 (2020)

arXiv:2004.02001 [pdf, other]

Graph Sequential Network for Reasoning over Sequences

Authors: Ming Tu, Jing Huang, Xiaodong He, Bowen Zhou

Abstract: Recently Graph Neural Network (GNN) has been applied successfully to various NLP tasks that require reasoning, such as multi-hop machine reading comprehension. In this paper, we consider a novel case where reasoning is needed over graphs built from sequences, i.e. graph nodes with sequence data. Existing GNN models fulfill this goal by first summarizing the node sequences into fixed-dimensional ve… ▽ More Recently Graph Neural Network (GNN) has been applied successfully to various NLP tasks that require reasoning, such as multi-hop machine reading comprehension. In this paper, we consider a novel case where reasoning is needed over graphs built from sequences, i.e. graph nodes with sequence data. Existing GNN models fulfill this goal by first summarizing the node sequences into fixed-dimensional vectors, then applying GNN on these vectors. To avoid information loss inherent in the early summarization and make sequential labeling tasks on GNN output feasible, we propose a new type of GNN called Graph Sequential Network (GSN), which features a new message passing algorithm based on co-attention between a node and each of its neighbors. We validate the proposed GSN on two NLP tasks: interpretable multi-hop reading comprehension on HotpotQA and graph based fact verification on FEVER. Both tasks require reasoning over multiple documents or sentences. Our experimental results show that the proposed GSN attains better performance than the standard GNN based methods. △ Less

Submitted 4 April, 2020; originally announced April 2020.

Comments: Part of this paper was presented at NeurIPS 2019 Workshop on Graph Representation Learning

arXiv:2004.01326 [pdf, other]

doi 10.1088/2053-1583/ab89e8

Non-adiabatic Hall effect at Berry curvature hot spot

Authors: Matisse Wei-Yuan Tu, Ci Li, Hongyi Yu, Wang Yao

Abstract: Hot spot of Berry curvature is usually found at Bloch band anti-crossings, where the Hall effect due to the Berry phase can be most pronounced. With small gaps there, the adiabatic limit for the existing formulations of Hall current can be exceeded in a moderate electric field. Here we present a theory of non-adiabatic Hall effect, capturing non-perturbatively the across gap electron-hole excitati… ▽ More Hot spot of Berry curvature is usually found at Bloch band anti-crossings, where the Hall effect due to the Berry phase can be most pronounced. With small gaps there, the adiabatic limit for the existing formulations of Hall current can be exceeded in a moderate electric field. Here we present a theory of non-adiabatic Hall effect, capturing non-perturbatively the across gap electron-hole excitations by the electric field. We find a general connection between the field induced electron-hole coherence and intrinsic Hall velocity. In coherent evolution, the electron-hole coherence can manifest as a sizeable ac Hall velocity. When environmental noise is taken into account, its joint action with the electric field favors a form of electron-hole coherence that is function of wavevector and field only, leading to a dc nonlinear Hall effect. The Hall current has all odd order terms in field, and still retains the intrinsic role of the Berry curvature. The quantitative demonstration uses the example of gapped Dirac cones, and our theory can be used to describe the bulk pseudospin Hall current in insulators with gapped edge such as graphene and 2D MnBi$_{2}$Te$_{4}$ △ Less

Submitted 2 April, 2020; originally announced April 2020.

Comments: submitted, 5 pages, 2 figures

Journal ref: 2D Mater. 2020

arXiv:2003.03909 [pdf, other]

doi 10.1103/PhysRevLett.124.236001

RIXS Reveals Hidden Local Transitions of the Aqueous OH Radical

Authors: L. Kjellsson, K. Nanda, J. -E. Rubensson, G. Doumy, S. H. Southworth, P. J. Ho, A. M. March, A. Al Haddad, Y. Kumagai, M. -F. Tu, R. Schaller, T. Debnath, M. S. Bin Mohd Yusof, C. Arnold, W. F. Schlotter, S. Moeller, G. Coslovich, J. D. Koralek, M. P. Minitti, M. L. Vidal, M. Simon, R. Santra, Z. -H. Loh, vS. Coriani, A. I. Krylov , et al. (1 additional authors not shown)

Abstract: Resonant inelastic x-ray scattering (RIXS) provides remarkable opportunities to interrogate ultrafast dynamics in liquids. Here we use RIXS to study the fundamentally and practically important hydroxyl radical in liquid water, OH(aq). Impulsive ionization of pure liquid water produced a short-lived population of OH(aq), which was probed using femtosecond x-rays from an x-ray free-electron laser. W… ▽ More Resonant inelastic x-ray scattering (RIXS) provides remarkable opportunities to interrogate ultrafast dynamics in liquids. Here we use RIXS to study the fundamentally and practically important hydroxyl radical in liquid water, OH(aq). Impulsive ionization of pure liquid water produced a short-lived population of OH(aq), which was probed using femtosecond x-rays from an x-ray free-electron laser. We find that RIXS reveals localized electronic transitions that are masked in the ultraviolet absorption spectrum by strong charge-transfer transitions -- thus providing a means to investigate the evolving electronic structure and reactivity of the hydroxyl radical in aqueous and heterogeneous environments. First-principles calculations provide interpretation of the main spectral features. △ Less

Submitted 8 March, 2020; originally announced March 2020.

Comments: 40 pages, 10 figures

Journal ref: Phys. Rev. Lett. 124, 236001 (2020)

arXiv:1911.01533 [pdf, other]

Speaker-invariant Affective Representation Learning via Adversarial Training

Authors: Haoqi Li, Ming Tu, Jing Huang, Shrikanth Narayanan, Panayiotis Georgiou

Abstract: Representation learning for speech emotion recognition is challenging due to labeled data sparsity issue and lack of gold standard references. In addition, there is much variability from input speech signals, human subjective perception of the signals and emotion label ambiguity. In this paper, we propose a machine learning framework to obtain speech emotion representations by limiting the effect… ▽ More Representation learning for speech emotion recognition is challenging due to labeled data sparsity issue and lack of gold standard references. In addition, there is much variability from input speech signals, human subjective perception of the signals and emotion label ambiguity. In this paper, we propose a machine learning framework to obtain speech emotion representations by limiting the effect of speaker variability in the speech signals. Specifically, we propose to disentangle the speaker characteristics from emotion through an adversarial training network in order to better represent emotion. Our method combines the gradient reversal technique with an entropy loss function to remove such speaker information. Our approach is evaluated on both IEMOCAP and CMU-MOSEI datasets. We show that our method improves speech emotion classification and increases generalization to unseen speakers. △ Less

Submitted 12 August, 2021; v1 submitted 4 November, 2019; originally announced November 2019.

Comments: Accepted by ICASSP 2020; 5 pages

arXiv:1911.00930 [pdf, other]

doi 10.1039/D0CP00305K

Are 2D fingerprints still valuable for drug discovery?

Authors: Kaifu Gao, Duc Duy Nguyen, Vishnu Sresht, Alan M. Mathiowetz, Meihua Tu, Guo-Wei Wei

Abstract: Recently, molecular fingerprints extracted from three-dimensional (3D) structures using advanced mathematics, such as algebraic topology, differential geometry, and graph theory have been paired with efficient machine learning, especially deep learning algorithms to outperform other methods in drug discovery applications and competitions. This raises the question of whether classical 2D fingerprin… ▽ More Recently, molecular fingerprints extracted from three-dimensional (3D) structures using advanced mathematics, such as algebraic topology, differential geometry, and graph theory have been paired with efficient machine learning, especially deep learning algorithms to outperform other methods in drug discovery applications and competitions. This raises the question of whether classical 2D fingerprints are still valuable in computer-aided drug discovery. This work considers 23 datasets associated with four typical problems, namely protein-ligand binding, toxicity, solubility and partition coefficient to assess the performance of eight 2D fingerprints. Advanced machine learning algorithms including random forest, gradient boosted decision tree, single-task deep neural network and multitask deep neural network are employed to construct efficient 2D-fingerprint based models. Additionally, appropriate consensus models are built to further enhance the performance of 2D-fingerprintbased methods. It is demonstrated that 2D-fingerprint-based models perform as well as the state-of-the-art 3D structure-based models for the predictions of toxicity, solubility, partition coefficient and protein-ligand binding affinity based on only ligand information. However, 3D structure-based models outperform 2D fingerprint-based methods in complex-based protein-ligand binding affinity predictions. △ Less

Submitted 3 November, 2019; originally announced November 2019.

arXiv:1911.00484 [pdf, other]

Select, Answer and Explain: Interpretable Multi-hop Reading Comprehension over Multiple Documents

Authors: Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang, Xiaodong He, Bowen Zhou

Abstract: Interpretable multi-hop reading comprehension (RC) over multiple documents is a challenging problem because it demands reasoning over multiple information sources and explaining the answer prediction by providing supporting evidences. In this paper, we propose an effective and interpretable Select, Answer and Explain (SAE) system to solve the multi-document RC problem. Our system first filters out… ▽ More Interpretable multi-hop reading comprehension (RC) over multiple documents is a challenging problem because it demands reasoning over multiple information sources and explaining the answer prediction by providing supporting evidences. In this paper, we propose an effective and interpretable Select, Answer and Explain (SAE) system to solve the multi-document RC problem. Our system first filters out answer-unrelated documents and thus reduce the amount of distraction information. This is achieved by a document classifier trained with a novel pairwise learning-to-rank loss. The selected answer-related documents are then input to a model to jointly predict the answer and supporting sentences. The model is optimized with a multi-task learning objective on both token level for answer prediction and sentence level for supporting sentences prediction, together with an attention-based interaction between these two tasks. Evaluated on HotpotQA, a challenging multi-hop RC data set, the proposed SAE system achieves top competitive performance in distractor setting compared to other existing systems on the leaderboard. △ Less

Submitted 10 February, 2020; v1 submitted 1 November, 2019; originally announced November 2019.

Comments: Accepted to AAAI 2020

arXiv:1906.04881 [pdf, other]

Multiple instance learning with graph neural networks

Authors: Ming Tu, Jing Huang, Xiaodong He, Bowen Zhou

Abstract: Multiple instance learning (MIL) aims to learn the mapping between a bag of instances and the bag-level label. In this paper, we propose a new end-to-end graph neural network (GNN) based algorithm for MIL: we treat each bag as a graph and use GNN to learn the bag embedding, in order to explore the useful structural information among instances in bags. The final graph representation is fed into a c… ▽ More Multiple instance learning (MIL) aims to learn the mapping between a bag of instances and the bag-level label. In this paper, we propose a new end-to-end graph neural network (GNN) based algorithm for MIL: we treat each bag as a graph and use GNN to learn the bag embedding, in order to explore the useful structural information among instances in bags. The final graph representation is fed into a classifier for label prediction. Our algorithm is the first attempt to use GNN for MIL. We empirically show that the proposed algorithm achieves the state of the art performance on several popular MIL data sets without losing model interpretability. △ Less

Submitted 11 June, 2019; originally announced June 2019.

Comments: Accepted to ICML 2019 Workshop on Learning and Reasoning with Graph-Structured Representations

arXiv:1905.07374 [pdf, other]

Multi-hop Reading Comprehension across Multiple Documents by Reasoning over Heterogeneous Graphs

Authors: Ming Tu, Guangtao Wang, Jing Huang, Yun Tang, Xiaodong He, Bowen Zhou

Abstract: Multi-hop reading comprehension (RC) across documents poses new challenge over single-document RC because it requires reasoning over multiple documents to reach the final answer. In this paper, we propose a new model to tackle the multi-hop RC problem. We introduce a heterogeneous graph with different types of nodes and edges, which is named as Heterogeneous Document-Entity (HDE) graph. The advant… ▽ More Multi-hop reading comprehension (RC) across documents poses new challenge over single-document RC because it requires reasoning over multiple documents to reach the final answer. In this paper, we propose a new model to tackle the multi-hop RC problem. We introduce a heterogeneous graph with different types of nodes and edges, which is named as Heterogeneous Document-Entity (HDE) graph. The advantage of HDE graph is that it contains different granularity levels of information including candidates, documents and entities in specific document contexts. Our proposed model can do reasoning over the HDE graph with nodes representation initialized with co-attention and self-attention based context encoders. We employ Graph Neural Networks (GNN) based message passing algorithms to accumulate evidences on the proposed HDE graph. Evaluated on the blind test set of the Qangaroo WikiHop data set, our HDE graph based single model delivers competitive result, and the ensemble model achieves the state-of-the-art performance. △ Less

Submitted 4 June, 2019; v1 submitted 17 May, 2019; originally announced May 2019.

Comments: To appear in ACL 2019

arXiv:1904.07386 [pdf, other]

I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

Authors: Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen, Hitoshi Yamamoto, Koji Okabe, Ville Vestman, Jing Huang, Guohong Ding, Hanwu Sun, Anthony Larcher, Rohan Kumar Das, Haizhou Li, Mickael Rouvier, Pierre-Michel Bousquet, Wei Rao, Qing Wang, Chunlei Zhang, Fahimeh Bahmaninezhad, Hector Delgado, Jose Patino, Qiongqiong Wang, Ling Guo, Takafumi Koshinaka, Jiacen Zhang, Koichi Shinoda , et al. (21 additional authors not shown)

Abstract: The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the res… ▽ More The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve sub-systems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others, a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation. △ Less

Submitted 15 April, 2019; originally announced April 2019.

Comments: 5 pages

arXiv:1903.09606 [pdf, other]

Towards adversarial learning of speaker-invariant representation for speech emotion recognition

Authors: Ming Tu, Yun Tang, Jing Huang, Xiaodong He, Bowen Zhou

Abstract: Speech emotion recognition (SER) has attracted great attention in recent years due to the high demand for emotionally intelligent speech interfaces. Deriving speaker-invariant representations for speech emotion recognition is crucial. In this paper, we propose to apply adversarial training to SER to learn speaker-invariant representations. Our model consists of three parts: a representation learni… ▽ More Speech emotion recognition (SER) has attracted great attention in recent years due to the high demand for emotionally intelligent speech interfaces. Deriving speaker-invariant representations for speech emotion recognition is crucial. In this paper, we propose to apply adversarial training to SER to learn speaker-invariant representations. Our model consists of three parts: a representation learning sub-network with time-delay neural network (TDNN) and LSTM with statistical pooling, an emotion classification network and a speaker classification network. Both the emotion and speaker classification network take the output of the representation learning network as input. Two training strategies are employed: one based on domain adversarial training (DAT) and the other one based on cross-gradient training (CGT). Besides the conventional data set, we also evaluate our proposed models on a much larger publicly available emotion data set with 250 speakers. Evaluation results show that on IEMOCAP, DAT and CGT provides 5.6% and 7.4% improvement respectively, over a baseline system without speaker-invariant representation learning on 5-fold cross validation. On the larger emotion data set, while CGT fails to yield better results than baseline, DAT can still provide 9.8% relative improvement on a standalone test set. △ Less

Submitted 22 March, 2019; originally announced March 2019.

arXiv:1812.03594

New Perfect Nonlinear Functions over Finite Fields

Authors: Jinquan Luo, Junru Ma, Min Tu

Abstract: In this paper we present a new class of perfect nonlinear %Dembowski-Ostrom polynomials over $\mathbb{F}_{p^{2k}}$ for any odd prime $p$. In addition, we show that the new perfect nonlinear functions are CCZ-inequivalent to all the previously known perfect nonlinear functions in general. In this paper we present a new class of perfect nonlinear %Dembowski-Ostrom polynomials over $\mathbb{F}_{p^{2k}}$ for any odd prime $p$. In addition, we show that the new perfect nonlinear functions are CCZ-inequivalent to all the previously known perfect nonlinear functions in general. △ Less

Submitted 3 May, 2019; v1 submitted 9 December, 2018; originally announced December 2018.

Comments: This result is not new. It has been found by other researchers many years ago

arXiv:1812.01834 [pdf, other]

doi 10.1126/sciadv.aau6120

Gate tuning from exciton superfluid to quantum anomalous Hall in van der Waals heterobilayer

Authors: Qizhong Zhu, Matisse Wei-Yuan Tu, Qingjun Tong, Wang Yao

Abstract: Van der Waals heterostructures of 2D materials provide a powerful approach towards engineering various quantum phases of matters. Examples include topological matters such as quantum spin Hall (QSH) insulator, and correlated matters such as exciton superfluid. It can be of great interest to realize these vastly different quantum matters on a common platform, however, their distinct origins tend to… ▽ More Van der Waals heterostructures of 2D materials provide a powerful approach towards engineering various quantum phases of matters. Examples include topological matters such as quantum spin Hall (QSH) insulator, and correlated matters such as exciton superfluid. It can be of great interest to realize these vastly different quantum matters on a common platform, however, their distinct origins tend to restrict them to material systems of incompatible characters. Here we show that heterobilayers of two-dimensional valley semiconductors can be tuned through interlayer bias between an exciton superfluid (ES), a quantum anomalous Hall (QAH) insulator, and a QSH insulator. The tunability between these distinct phases results from the competition of Coulomb interaction with the interlayer quantum tunnelling that has a chiral form in valley semiconductors. Our findings point to exciting opportunities for harnessing both protected topological edge channels and bulk superfluidity in an electrically configurable platform. △ Less

Submitted 5 December, 2018; originally announced December 2018.

Comments: To appear in Science Advances

Journal ref: Sci. Adv. 5, eaau6120 (2019)

arXiv:1807.05285 [pdf]

doi 10.1021/acs.nanolett.8b04160

Voltage Control of a van der Waals Spin-Filter Magnetic Tunnel Junction

Authors: Tiancheng Song, Matisse Wei-Yuan Tu, Caitlin Carnahan, Xinghan Cai, Takashi Taniguchi, Kenji Watanabe, Michael A. McGuire, David H. Cobden, Di Xiao, Wang Yao, Xiaodong Xu

Abstract: Atomically thin chromium triiodide (CrI3) has recently been identified as a layered antiferromagnetic insulator, in which adjacent ferromagnetic monolayers are antiferromagnetically coupled. This unusual magnetic structure naturally comprises a series of anti-aligned spin filters which can be utilized to make spin-filter magnetic tunnel junctions with very large tunneling magnetoresistance (TMR).… ▽ More Atomically thin chromium triiodide (CrI3) has recently been identified as a layered antiferromagnetic insulator, in which adjacent ferromagnetic monolayers are antiferromagnetically coupled. This unusual magnetic structure naturally comprises a series of anti-aligned spin filters which can be utilized to make spin-filter magnetic tunnel junctions with very large tunneling magnetoresistance (TMR). Here we report voltage control of TMR formed by four-layer CrI3 sandwiched by monolayer graphene contacts in a dual-gated structure. By varying the gate voltages at fixed magnetic field, the device can be switched reversibly between bistable magnetic states with the same net magnetization but drastically different resistance (by a factor of ten or more). In addition, without switching the state, the TMR can be continuously modulated between 17,000% and 57,000%, due to the combination of spin-dependent tunnel barrier with changing carrier distributions in the graphene contacts. Our work demonstrates new kinds of magnetically moderated transistor action and opens up possibilities for voltage-controlled van der Waals spintronic devices. △ Less

Submitted 13 July, 2018; originally announced July 2018.

arXiv:1807.01738 [pdf, other]

Investigating the role of L1 in automatic pronunciation evaluation of L2 speech

Authors: Ming Tu, Anna Grabek, Julie Liss, Visar Berisha

Abstract: Automatic pronunciation evaluation plays an important role in pronunciation training and second language education. This field draws heavily on concepts from automatic speech recognition (ASR) to quantify how close the pronunciation of non-native speech is to native-like pronunciation. However, it is known that the formation of accent is related to pronunciation patterns of both the target languag… ▽ More Automatic pronunciation evaluation plays an important role in pronunciation training and second language education. This field draws heavily on concepts from automatic speech recognition (ASR) to quantify how close the pronunciation of non-native speech is to native-like pronunciation. However, it is known that the formation of accent is related to pronunciation patterns of both the target language (L2) and the speaker's first language (L1). In this paper, we propose to use two native speech acoustic models, one trained on L2 speech and the other trained on L1 speech. We develop two sets of measurements that can be extracted from two acoustic models given accented speech. A new utterance-level feature extraction scheme is used to convert these measurements into a fixed-dimension vector which is used as an input to a statistical model to predict the accentedness of a speaker. On a data set consisting of speakers from 4 different L1 backgrounds, we show that the proposed system yields improved correlation with human evaluators compared to systems only using the L2 acoustic model. △ Less

Submitted 4 July, 2018; originally announced July 2018.

Comments: To appear in Interspeech 2018

arXiv:1804.10325 [pdf, other]

Simulating dysarthric speech for training data augmentation in clinical speech applications

Authors: Yishan Jiao, Ming Tu, Visar Berisha, Julie Liss

Abstract: Training machine learning algorithms for speech applications requires large, labeled training data sets. This is problematic for clinical applications where obtaining such data is prohibitively expensive because of privacy concerns or lack of access. As a result, clinical speech applications are typically developed using small data sets with only tens of speakers. In this paper, we propose a metho… ▽ More Training machine learning algorithms for speech applications requires large, labeled training data sets. This is problematic for clinical applications where obtaining such data is prohibitively expensive because of privacy concerns or lack of access. As a result, clinical speech applications are typically developed using small data sets with only tens of speakers. In this paper, we propose a method for simulating training data for clinical applications by transforming healthy speech to dysarthric speech using adversarial training. We evaluate the efficacy of our approach using both objective and subjective criteria. We present the transformed samples to five experienced speech-language pathologists (SLPs) and ask them to identify the samples as healthy or dysarthric. The results reveal that the SLPs identify the transformed speech as dysarthric 65% of the time. In a pilot classification experiment, we show that by using the simulated speech samples to balance an existing dataset, the classification accuracy improves by about 10% after data augmentation. △ Less

Submitted 26 April, 2018; originally announced April 2018.

Comments: Will appear in Proc. of ICASSP 2018

arXiv:1804.08663 [pdf, other]

A Discriminative Acoustic-Prosodic Approach for Measuring Local Entrainment

Authors: Megan M. Willi, Stephanie A. Borrie, Tyson S. Barrett, Ming Tu, Visar Berisha

Abstract: Acoustic-prosodic entrainment describes the tendency of humans to align or adapt their speech acoustics to each other in conversation. This alignment of spoken behavior has important implications for conversational success. However, modeling the subtle nature of entrainment in spoken dialogue continues to pose a challenge. In this paper, we propose a straightforward definition for local entrainmen… ▽ More Acoustic-prosodic entrainment describes the tendency of humans to align or adapt their speech acoustics to each other in conversation. This alignment of spoken behavior has important implications for conversational success. However, modeling the subtle nature of entrainment in spoken dialogue continues to pose a challenge. In this paper, we propose a straightforward definition for local entrainment in the speech domain and operationalize an algorithm based on this: acoustic-prosodic features that capture entrainment should be maximally different between real conversations involving two partners and sham conversations generated by randomly mixing the speaking turns from the original two conversational partners. We propose an approach for measuring local entrainment that quantifies alignment of behavior on a turn-by-turn basis, projecting the differences between interlocutors' acoustic-prosodic features for a given turn onto a discriminative feature subspace that maximizes the difference between real and sham conversations. We evaluate the method using the derived features to drive a classifier aiming to predict an objective measure of conversational success (i.e., low versus high), on a corpus of task-oriented conversations. The proposed entrainment approach achieves 72% classification accuracy using a Naive Bayes classifier, outperforming three previously established approaches evaluated on the same conversational corpus. △ Less

Submitted 12 July, 2018; v1 submitted 23 April, 2018; originally announced April 2018.

arXiv:1801.08679 [pdf]

doi 10.1126/science.aar4851

Giant Tunneling Magnetoresistance in Spin-Filter van der Waals Heterostructures

Authors: Tiancheng Song, Xinghan Cai, Matisse Wei-Yuan Tu, Xiaoou Zhang, Bevin Huang, Nathan P. Wilson, Kyle L. Seyler, Lin Zhu, Takashi Taniguchi, Kenji Watanabe, Michael A. McGuire, David H. Cobden, Di Xiao, Wang Yao, Xiaodong Xu

Abstract: Magnetic multilayer devices that exploit magnetoresistance are the backbone of magnetic sensing and data storage technologies. Here we report novel multiple-spin-filter magnetic tunnel junctions (sf-MTJs) based on van der Waals (vdW) heterostructures in which atomically thin chromium triiodide (CrI3) acts as a spin-filter tunnel barrier sandwiched between graphene contacts. We demonstrate tunnelin… ▽ More Magnetic multilayer devices that exploit magnetoresistance are the backbone of magnetic sensing and data storage technologies. Here we report novel multiple-spin-filter magnetic tunnel junctions (sf-MTJs) based on van der Waals (vdW) heterostructures in which atomically thin chromium triiodide (CrI3) acts as a spin-filter tunnel barrier sandwiched between graphene contacts. We demonstrate tunneling magnetoresistance which is drastically enhanced with increasing CrI3 layer thickness, reaching a record 19,000% for magnetic multilayer structures using four-layer sf-MTJs at low temperatures. These devices also show multiple resistance states as a function of magnetic field, suggesting the potential for multi-bit functionalities using an individual vdW sf-MTJ. Using magnetic circular dichroism measurements, we attribute these effects to the intrinsic layer-by-layer antiferromagnetic ordering of the atomically thin CrI3. Our work reveals the possibility to push magnetic information storage to the atomically thin limit, and highlights CrI3 as a superlative magnetic tunnel barrier for vdW heterostructure spintronic devices. △ Less

Submitted 26 January, 2018; originally announced January 2018.

Comments: Submitted

arXiv:1712.05370 [pdf, other]

Stabilizing a high-pressure phase in InSb at ambient conditions with a laser-driven pressure pulse

Authors: A. Jarnac, Xiaocui Wang, A. U. J Bengtsson, M. Burza, J. C. Ekstrom, H. Enquist, A. Jurgilaitis, N. Kretzschmar, A. I. H. Persson, C. M. Tu, M. Wulff, F. Dorchies, J. Larsson

Abstract: In this letter, we describe the stabilization of indium antimonide (InSb) in the high-pressure orthorhombic phase (InSb-III) at ambient conditions. Until now, InSb-III has only been observed above 9 GPa, or at around 3 GPa as a metastable structure during the phase transition from cubic zinc blende (InSb-I) to orthorhombic InSb-IV. The crystalline phase transition from InSb-I to InSb-III was drive… ▽ More In this letter, we describe the stabilization of indium antimonide (InSb) in the high-pressure orthorhombic phase (InSb-III) at ambient conditions. Until now, InSb-III has only been observed above 9 GPa, or at around 3 GPa as a metastable structure during the phase transition from cubic zinc blende (InSb-I) to orthorhombic InSb-IV. The crystalline phase transition from InSb-I to InSb-III was driven by an ultrashort, laser-generated, non-hydrostatic pressure pulse. The transition occurred in preferred orientations locked to the initial orientation of the InSb-I crystal, breaking the symmetry of the InSb-I cubic cell to form the InSb-III orthorhombic cell. △ Less

Submitted 14 December, 2017; originally announced December 2017.

arXiv:1706.04813 [pdf, other]

doi 10.1088/2053-1583/aa71fc

Switchable valley functionalities of an $n-n^{-}-n$ junction in 2D semiconductors

Authors: Matisse Wei-Yuan Tu, Wang Yao

Abstract: We show that an $n-n^{-}-n$ junction in 2D semiconductors can flexibly realize two basic valleytronic functions, i.e. valley filter and valley source, with gate controlled switchability between the two. Upon carrier flux passing through the junction, the valley filter and valley source functions are enabled respectively by intra- and inter-valley scatterings, and the two functions dominate respect… ▽ More We show that an $n-n^{-}-n$ junction in 2D semiconductors can flexibly realize two basic valleytronic functions, i.e. valley filter and valley source, with gate controlled switchability between the two. Upon carrier flux passing through the junction, the valley filter and valley source functions are enabled respectively by intra- and inter-valley scatterings, and the two functions dominate respectively at small and large band-offset between the $n$ and $n^{-}$ regions. It can be generally shown that, the valley filter effect has an angular dependent polarity and vanishes under angular integration, by the same constraint from time-reversal symmetry that leads to its absence in one-dimension. These findings are demonstrated for monolayer transition metal dichalcogenides and graphene using tight-binding calculations. We further show that junction along chiral directions can concentrate the valley pump in an angular interval largely separated from the bias direction, allowing efficient havest of valley polarization in a cross-bar device. △ Less

Submitted 15 June, 2017; originally announced June 2017.

Journal ref: 2D Mater. 4 (2017) 025109

Showing 1–50 of 99 results for author: Tu, M