Skip to main content

Showing 1–48 of 48 results for author: Dang, J

  1. arXiv:2407.02552  [pdf, other

    cs.CL cs.AI cs.LG

    RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs

    Authors: John Dang, Arash Ahmadian, Kelly Marchisio, Julia Kreutzer, Ahmet Üstün, Sara Hooker

    Abstract: Preference optimization techniques have become a standard final stage for training state-of-art large language models (LLMs). However, despite widespread adoption, the vast majority of work to-date has focused on first-class citizen languages like English and Chinese. This captures a small fraction of the languages in the world, but also makes it unclear which aspects of current state-of-the-art r… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  2. arXiv:2407.00743  [pdf, other

    cs.MM cs.AI cs.CL eess.AS

    AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

    Authors: Sheng Wu, Jiaxing Liu, Longbiao Wang, Dongxiao He, Xiaobao Wang, Jianwu Dang

    Abstract: Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework called AIMDiT to solve the problem of multimodal fusion… ▽ More

    Submitted 12 April, 2024; originally announced July 2024.

  3. arXiv:2406.08911  [pdf, other

    cs.CL eess.AS

    An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

    Authors: Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi

    Abstract: Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  4. arXiv:2405.15032  [pdf, other

    cs.CL

    Aya 23: Open Weight Releases to Further Multilingual Progress

    Authors: Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, Kelly Marchisio, Max Bartolo, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan Gomez, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, Sara Hooker

    Abstract: This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (Üstün et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modelin… ▽ More

    Submitted 31 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  5. arXiv:2404.11129  [pdf, other

    cs.CV

    Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales

    Authors: Minghe Gao, Shuang Chen, Liang Pang, Yuan Yao, Jisheng Dang, Wenqiao Zhang, Juncheng Li, Siliang Tang, Yueting Zhuang, Tat-Seng Chua

    Abstract: The remarkable performance of Multimodal Large Language Models (MLLMs) has unequivocally demonstrated their proficient understanding capabilities in handling a wide array of visual tasks. Nevertheless, the opaque nature of their black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate compositional rea… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  6. arXiv:2404.03216  [pdf, other

    cs.CR

    Accurate Low-Degree Polynomial Approximation of Non-polynomial Operators for Fast Private Inference in Homomorphic Encryption

    Authors: Jianming Tong, Jingtian Dang, Anupam Golder, Callie Hao, Arijit Raychowdhury, Tushar Krishna

    Abstract: As machine learning (ML) permeates fields like healthcare, facial recognition, and blockchain, the need to protect sensitive data intensifies. Fully Homomorphic Encryption (FHE) allows inference on encrypted data, preserving the privacy of both data and the ML model. However, it slows down non-secure inference by up to five magnitudes, with a root cause of replacing non-polynomial operators (ReLU… ▽ More

    Submitted 7 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

    Comments: Proceedings of the 5th MLSys Conference, Santa Clara, CA, USA, 2024. Copyright 2024 by the author(s)

  7. arXiv:2401.02081  [pdf, ps, other

    cs.IT eess.SP

    Performance Trade-off and Joint Waveform Design for MIMO-OFDM DFRC Systems

    Authors: Tianchen Liu, Liang Wu, Bo An, Zaichen Zhang, Jian Dang, Jiangzhou Wang

    Abstract: Dual-functional radar-communication (DFRC) has attracted considerable attention. This paper considers the frequency-selective multipath fading environment and proposes DFRC waveform design strategies based on multiple-input and multiple-output (MIMO) and orthogonal frequency division multiplexing (OFDM) techniques. In the proposed waveform design strategies, the Cramer-Rao bound (CRB) of the radar… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

  8. arXiv:2312.14398  [pdf, other

    cs.SD eess.AS

    ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

    Authors: Cheng Gong, Xin Wang, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang, Korin Richmond, Junichi Yamagishi

    Abstract: Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. In most cases, TTS systems are built using a single speaker's voice. However, there is growing interest in developing systems that can synthesize voices… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: 13 pages, 5 figures

  9. arXiv:2312.11201  [pdf, other

    eess.AS cs.SD eess.SP

    A Refining Underlying Information Framework for Monaural Speech Enhancement

    Authors: Rui Cao, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

    Abstract: Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and t… ▽ More

    Submitted 24 December, 2023; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: 5 pages

  10. arXiv:2312.07032  [pdf, ps, other

    cs.LG stat.ML

    Ahpatron: A New Budgeted Online Kernel Learning Machine with Tighter Mistake Bound

    Authors: Yun Liao, Junfan Li, Shizhong Liao, Qinghua Hu, Jianwu Dang

    Abstract: In this paper, we study the mistake bound of online kernel learning on a budget. We propose a new budgeted online kernel learning model, called Ahpatron, which significantly improves the mistake bound of previous work and resolves the open problem posed by Dekel, Shalev-Shwartz, and Singer (2005). We first present an aggressive variant of Perceptron, named AVP, a model without budget, which uses a… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

  11. arXiv:2310.11523  [pdf, other

    cs.LG cs.AI cs.CL

    Group Preference Optimization: Few-Shot Alignment of Large Language Models

    Authors: Siyan Zhao, John Dang, Aditya Grover

    Abstract: Many applications of large language models (LLMs), ranging from chatbots to creative writing, require nuanced subjective judgments that can differ significantly across different groups. Existing alignment algorithms can be expensive to align for each group, requiring prohibitive amounts of group-specific preference data and computation for real-world use cases. We introduce Group Preference Optimi… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: 24 pages, 12 figures

  12. arXiv:2309.15512  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

    Authors: Chunyu Qiang, Hao Li, Yixin Tian, Yi Zhao, Ying Zhang, Longbiao Wang, Jianwu Dang

    Abstract: Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic \& acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from inform… ▽ More

    Submitted 18 December, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP 2024. arXiv admin note: substantial text overlap with arXiv:2307.15484; text overlap with arXiv:2309.00424

  13. arXiv:2309.00424  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Learning Speech Representation From Contrastive Token-Acoustic Pretraining

    Authors: Chunyu Qiang, Hao Li, Yixin Tian, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang

    Abstract: For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic informati… ▽ More

    Submitted 18 December, 2023; v1 submitted 1 September, 2023; originally announced September 2023.

    Comments: Accepted by ICASSP 2024

  14. arXiv:2308.15812  [pdf, other

    cs.LG cs.AI cs.CL

    Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

    Authors: Hritik Bansal, John Dang, Aditya Grover

    Abstract: Aligning large language models (LLMs) with human values and intents critically involves the use of human or AI feedback. While dense feedback annotations are expensive to acquire and integrate, sparse feedback presents a structural design choice between ratings (e.g., score Response A on a scale of 1-7) and rankings (e.g., is Response A better than Response B?). In this work, we analyze the effect… ▽ More

    Submitted 5 February, 2024; v1 submitted 30 August, 2023; originally announced August 2023.

    Comments: 31 pages, Accepted to ICLR 2024

  15. arXiv:2307.15484  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

    Authors: Chunyu Qiang, Hao Li, Hao Ni, He Qu, Ruibo Fu, Tao Wang, Longbiao Wang, Jianwu Dang

    Abstract: Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging pr… ▽ More

    Submitted 18 December, 2023; v1 submitted 28 July, 2023; originally announced July 2023.

    Comments: Accepted by ICASSP 2024

  16. arXiv:2307.06657  [pdf, other

    cs.IT eess.SP

    Downlink Precoding for Cell-free FBMC/OQAM Systems With Asynchronous Reception

    Authors: Yuhao Qi, Jian Dang, Zaichen Zhang, Liang Wu, Yongpeng Wu

    Abstract: In this work, an efficient precoding design scheme is proposed for downlink cell-free distributed massive multiple-input multiple-output (DM-MIMO) filter bank multi-carrier (FBMC) systems with asynchronous reception and highly frequency selectivity. The proposed scheme includes a multiple interpolation structure to eliminate the impact of response difference we recently discovered, which has bette… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: 16pages, 4 figures

  17. arXiv:2306.02625  [pdf, other

    cs.SD eess.AS

    Rethinking the visual cues in audio-visual speaker extraction

    Authors: Junjie Li, Meng Ge, Zexu pan, Rui Cao, Longbiao Wang, Jianwu Dang, Shiliang Zhang

    Abstract: The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction p… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Accepted in Interspeech 2023

  18. arXiv:2305.17860  [pdf, other

    cs.SD eess.AS

    speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition

    Authors: Haoyu Lu, Nan Li, Tongtong Song, Longbiao Wang, Jianwu Dang, Xiaobao Wang, Shiliang Zhang

    Abstract: In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with d… ▽ More

    Submitted 30 May, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

  19. arXiv:2303.14593  [pdf, other

    cs.SD eess.AS

    Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

    Authors: Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang, Tatsuya Kawahara

    Abstract: Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information,… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

  20. arXiv:2303.05070  [pdf, other

    cs.IT

    Pilot-Free Unsourced Random Access Via Dictionary Learning and Error-Correcting Codes

    Authors: Zhentian Zhang, Jian Dang, Zaichen Zhang, Liang Wu, Bingcheng Zhu, Lei Wang

    Abstract: Massive machine-type communications (mMTC) or massive access is a critical scenario in the fifth generation (5G) and the future cellular network. With the surging density of devices from millions to billions, unique pilot allocation becomes inapplicable in the user ID-incorporated grant-free random access protocol. Unsourced random access (URA) manifests itself by focusing only on unwrapping the r… ▽ More

    Submitted 9 March, 2023; originally announced March 2023.

  21. arXiv:2302.11254  [pdf, other

    cs.SD cs.CV cs.LG eess.AS eess.IV

    Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

    Authors: Meng Liu, Kong Aik Lee, Longbiao Wang, Hanyi Zhang, Chang Zeng, Jianwu Dang

    Abstract: Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-mod… ▽ More

    Submitted 22 February, 2023; originally announced February 2023.

  22. arXiv:2302.09208  [pdf, other

    cs.CV

    Bridge Damage Cause Estimation Using Multiple Images Based on Visual Question Answering

    Authors: Tatsuro Yamane, Pang-jo Chun, Ji Dang, Takayuki Okatani

    Abstract: In this paper, a bridge member damage cause estimation framework is proposed by calculating the image position using Structure from Motion (SfM) and acquiring its information via Visual Question Answering (VQA). For this, a VQA model was developed that uses bridge images for dataset creation and outputs the damage or member name and its existence based on the images and questions. In the developed… ▽ More

    Submitted 17 February, 2023; originally announced February 2023.

  23. arXiv:2212.03401  [pdf, other

    eess.AS cs.LG cs.SD

    MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation

    Authors: Yanjie Fu, Haoran Yin, Meng Ge, Longbiao Wang, Gaoyan Zhang, Jianwu Dang, Chengyun Deng, Fei Wang

    Abstract: Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we d… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  24. arXiv:2211.01046  [pdf, other

    eess.AS cs.CL cs.SD

    Monolingual Recognizers Fusion for Code-switching Speech Recognition

    Authors: Tongtong Song, Qiang Xu, Haoyu Lu, Longbiao Wang, Hao Shi, Yuqin Lin, Yanbing Yang, Jianwu Dang

    Abstract: The bi-encoder structure has been intensively investigated in code-switching (CS) automatic speech recognition (ASR). However, most existing methods require the structures of two monolingual ASR models (MAMs) should be the same and only use the encoder of MAMs. This leads to the problem that pre-trained MAMs cannot be timely and fully used for CS ASR. In this paper, we propose a monolingual recogn… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP2023

  25. arXiv:2210.10401  [pdf, other

    cs.IT eess.SP

    Asynchronous RIS-assisted Localization: A Comprehensive Analysis of Fundamental Limits

    Authors: Ziyi Gong, Liang Wu, Zaichen Zhang, Jian Dang, Yongpeng Wu, Jiangzhou Wang

    Abstract: The reconfigurable intelligent surface (RIS) has drawn considerable attention for its ability to enhance the performance of not only the wireless communication but also the indoor localization with low-cost. This paper investigates the performance limits of the RIS-based near-field localization in the asynchronous scenario, and analyzes the impact of each part of the cascaded channel on the locali… ▽ More

    Submitted 26 March, 2023; v1 submitted 19 October, 2022; originally announced October 2022.

  26. arXiv:2210.06177  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

    Authors: Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang

    Abstract: Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previou… ▽ More

    Submitted 9 October, 2022; originally announced October 2022.

  27. arXiv:2210.05254  [pdf, other

    cs.SD cs.AI eess.AS

    Deep Spectro-temporal Artifacts for Detecting Synthesized Speech

    Authors: Xiaohui Liu, Meng Liu, Lin Zhang, Linjuan Zhang, Chang Zeng, Kai Li, Nan Li, Kong Aik Lee, Longbiao Wang, Jianwu Dang

    Abstract: The Audio Deep Synthesis Detection (ADD) Challenge has been held to detect generated human-like speech. With our submitted system, this paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection). In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding featur… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: 7 pages, 1 figures, Accecpted by Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia

  28. MIMO-DoAnet: Multi-channel Input and Multiple Outputs DoA Network with Unknown Number of Sound Sources

    Authors: Haoran Yin, Meng Ge, Yanjie Fu, Gaoyan Zhang, Longbiao Wang, Lei Zhang, Lin Qiu, Jianwu Dang

    Abstract: Recent neural network based Direction of Arrival (DoA) estimation algorithms have performed well on unknown number of sound sources scenarios. These algorithms are usually achieved by mapping the multi-channel audio input to the single output (i.e. overall spatial pseudo-spectrum (SPS) of all sources), that is called MISO. However, such MISO algorithms strongly depend on empirical threshold settin… ▽ More

    Submitted 16 November, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

    Comments: Accepted by Interspeech 2022

  29. arXiv:2206.14580  [pdf, other

    cs.CL eess.AS

    Language-specific Characteristic Assistance for Code-switching Speech Recognition

    Authors: Tongtong Song, Qiang Xu, Meng Ge, Longbiao Wang, Hao Shi, Yongjie Lv, Yuqin Lin, Jianwu Dang

    Abstract: Dual-encoder structure successfully utilizes two language-specific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained language-specific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, most existing methods have no language constraints on LSEs and underutili… ▽ More

    Submitted 11 July, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted by Interspeech 2022

  30. arXiv:2206.12273  [pdf, other

    eess.AS cs.LG

    Iterative Sound Source Localization for Unknown Number of Sources

    Authors: Yanjie Fu, Meng Ge, Haoran Yin, Xinyuan Qian, Longbiao Wang, Gaoyan Zhang, Jianwu Dang

    Abstract: Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these t… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: Accepted by Interspeech 2022

  31. arXiv:2205.00256  [pdf, other

    cs.LG

    Heterogeneous Graph Neural Networks using Self-supervised Reciprocally Contrastive Learning

    Authors: Cuiying Huo, Dongxiao He, Yawen Li, Di Jin, Jianwu Dang, Weixiong Zhang, Witold Pedrycz, Lingfei Wu

    Abstract: Heterogeneous graph neural network (HGNN) is a very popular technique for the modeling and analysis of heterogeneous graphs. Most existing HGNN-based approaches are supervised or semi-supervised learning methods requiring graphs to be annotated, which is costly and time-consuming. Self-supervised contrastive learning has been proposed to address the problem of requiring annotated data by mining in… ▽ More

    Submitted 16 November, 2023; v1 submitted 30 April, 2022; originally announced May 2022.

  32. arXiv:2203.09098  [pdf, other

    cs.SD cs.LG eess.AS

    TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding

    Authors: Ruiteng Zhang, Jianguo Wei, Xugang Lu, Wenhuan Lu, Di Jin, Junhai Xu, Lin Zhang, Yantao Ji, Jianwu Dang

    Abstract: Speaker embedding is an important front-end module to explore discriminative speaker features for many speech applications where speaker information is needed. Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation. However, naively adding many branches of multi-scale f… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

    Comments: Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract here is shorter than that in the PDF file

  33. arXiv:2202.09995  [pdf, other

    eess.AS cs.SD

    L-SpEx: Localized Target Speaker Extraction

    Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

    Abstract: Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this… ▽ More

    Submitted 21 February, 2022; originally announced February 2022.

    Comments: Accepted in ICASSP 2022

  34. arXiv:2201.11893  [pdf, ps, other

    cs.IT

    A New High Energy Efficiency Scheme Based on Two-Dimension Resource Blocks in Wireless Communication Systems

    Authors: Kang Liu, Zaichen Zhang, Jian Dang, Liang Wu, Bingchen Zhu, Lei Wang, Chuan Zhang

    Abstract: Energy efficiency (EE) plays a key role in future wireless communication network and it is easily to achieve high EE performance in low SNR regime. In this paper, a new high EE scheme is proposed for a MIMO wireless communication system working in the low SNR regime by using two dimension resource allocation. First, we define the high EE area based on the relationship between the transmission powe… ▽ More

    Submitted 27 January, 2022; originally announced January 2022.

  35. arXiv:2110.04451  [pdf, other

    cs.SD cs.AI eess.AS

    Using multiple reference audios and style embedding constraints for speech synthesis

    Authors: Cheng Gong, Longbiao Wang, Zhenhua Ling, Ju Zhang, Jianwu Dang

    Abstract: The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio. However, an appropriate acoustic embedding must be manually selected during inference. Due to the fact that only the matched text and speech are used in the training process, using unmatched text and spee… ▽ More

    Submitted 9 October, 2021; originally announced October 2021.

    Comments: 5 pages,3 figures submitted to ICASSP2022

  36. arXiv:2104.08510  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Exploring Deep Learning for Joint Audio-Visual Lip Biometrics

    Authors: Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Chang Zeng, Jianwu Dang

    Abstract: Audio-visual (AV) lip biometrics is a promising authentication technique that leverages the benefits of both the audio and visual modalities in speech communication. Previous works have demonstrated the usefulness of AV lip biometrics. However, the lack of a sizeable AV database hinders the exploration of deep-learning-based audio-visual lip biometrics. To address this problem, we compile a modera… ▽ More

    Submitted 17 April, 2021; originally announced April 2021.

  37. arXiv:2011.09624  [pdf, other

    eess.AS cs.LG

    Multi-stage Speaker Extraction with Utterance and Frame-Level Reference Signals

    Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

    Abstract: Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use fr… ▽ More

    Submitted 2 April, 2021; v1 submitted 18 November, 2020; originally announced November 2020.

    Comments: Accepted in ICASSP 2021

  38. arXiv:2010.05530  [pdf, other

    cs.IT eess.SP

    Transmit Covariance and Waveform Optimization for Non-orthogonal CP-FBMA System

    Authors: Yuhao Qi, Jian Dang, Zaichen Zhang, Liang Wu, Yongpeng Wu

    Abstract: Filter bank multiple access (FBMA) without subbands orthogonality has been proposed as a new candidate waveform to better meet the requirements of future wireless communication systems and scenarios. It has the ability to process directly the complex symbols without any fancy preprocessing. Along with the usage of cyclic prefix (CP) and wide-banded subband design, CP-FBMA can further improve the p… ▽ More

    Submitted 13 October, 2020; v1 submitted 12 October, 2020; originally announced October 2020.

    Comments: 30 pages, 9 figures, accepted for publication in the IEEE Transactions on Communications

  39. arXiv:2005.04686  [pdf, other

    eess.AS cs.SD

    SpEx+: A Complete Time Domain Speaker Extraction Network

    Authors: Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang, Haizhou Li

    Abstract: Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking frequency-… ▽ More

    Submitted 17 August, 2020; v1 submitted 10 May, 2020; originally announced May 2020.

    Comments: accepted in INTERSPEECH 2020

  40. Constructing Accurate and Efficient Deep Spiking Neural Networks with Double-threshold and Augmented Schemes

    Authors: Qiang Yu, Chenxiang Ma, Shiming Song, Gaoyan Zhang, Jianwu Dang, Kay Chen Tan

    Abstract: Spiking neural networks (SNNs) are considered as a potential candidate to overcome current challenges such as the high-power consumption encountered by artificial neural networks (ANNs), however there is still a gap between them with respect to the recognition accuracy on practical tasks. A conversion strategy was thus introduced recently to bridge this gap by mapping a trained ANN to an SNN. Howe… ▽ More

    Submitted 5 May, 2020; originally announced May 2020.

    Comments: 13 pages

  41. Towards Efficient Processing and Learning with Spikes: New Approaches for Multi-Spike Learning

    Authors: Qiang Yu, Shenglan Li, Huajin Tang, Longbiao Wang, Jianwu Dang, Kay Chen Tan

    Abstract: Spikes are the currency in central nervous systems for information transmission and processing. They are also believed to play an essential role in low-power consumption of the biological systems, whose efficiency attracts increasing attentions to the field of neuromorphic computing. However, efficient processing and learning of discrete spikes still remains as a challenging problem. In this paper… ▽ More

    Submitted 2 May, 2020; originally announced May 2020.

    Comments: 13 pages

  42. arXiv:2004.05772  [pdf, ps, other

    cs.IT eess.SP

    Joint User Identification and Channel Estimation Over Rician Fading Channels

    Authors: Liang Wu, Zaichen Zhang, Jian Dang, Yongpeng Wu, Huaping Liu, Jiangzhou Wang

    Abstract: This paper considers crowded massive multiple input multiple output (MIMO) communications over a Rician fading channel, where the number of users is much greater than the number of available pilot sequences. A joint user identification and line-of-sight (LOS) component derivation algorithm is proposed without requiring a threshold. Based on the derived LOS component, we design a LOS-only channel e… ▽ More

    Submitted 13 April, 2020; originally announced April 2020.

  43. Relation Modeling with Graph Convolutional Networks for Facial Action Unit Detection

    Authors: Zhilei Liu, Jiahui Dong, Cuicui Zhang, Longbiao Wang, Jianwu Dang

    Abstract: Most existing AU detection works considering AU relationships are relying on probabilistic graphical models with manually extracted features. This paper proposes an end-to-end deep learning framework for facial AU detection with graph convolutional network (GCN) for AU relation modeling, which has not been explored before. In particular, AU related regions are extracted firstly, latent representat… ▽ More

    Submitted 22 October, 2019; originally announced October 2019.

    Comments: Accepted by MMM2020

  44. arXiv:1902.01094  [pdf, other

    cs.NE

    Robust Environmental Sound Recognition with Sparse Key-point Encoding and Efficient Multi-spike Learning

    Authors: Qiang Yu, Yanli Yao, Longbiao Wang, Huajin Tang, Jianwu Dang, Kay Chen Tan

    Abstract: The capability for environmental sound recognition (ESR) can determine the fitness of individuals in a way to avoid dangers or pursue opportunities when critical sound events occur. It still remains mysterious about the fundamental principles of biological systems that result in such a remarkable ability. Additionally, the practical importance of ESR has attracted an increasing amount of research… ▽ More

    Submitted 4 February, 2019; originally announced February 2019.

    Comments: 13 pages,12 figures

  45. arXiv:1812.08869  [pdf, ps, other

    cs.IT

    Data-Rate Driven Transmission Strategy for Deep Learning Based Communication Systems

    Authors: Xiao Chen, Julian Cheng, Zaichen Zhang, Liang Wu, Jian Dang

    Abstract: Deep learning (DL) based autoencoder is a promising architecture to implement end-to-end communication systems. One fundamental problem of such systems is how to increase the transmission rate. Two new schemes are proposed to address the limited data rate issue: adaptive transmission scheme and generalized data representation (GDR) scheme. In the first scheme, an adaptive transmission is designed… ▽ More

    Submitted 28 April, 2020; v1 submitted 20 December, 2018; originally announced December 2018.

    Comments: Published

  46. Speech Emotion Recognition Considering Local Dynamic Features

    Authors: Haotian Guan, Zhilei Liu, Longbiao Wang, Jianwu Dang, Ruiguo Yu

    Abstract: Recently, increasing attention has been directed to the study of the speech emotion recognition, in which global acoustic features of an utterance are mostly used to eliminate the content differences. However, the expression of speech emotion is a dynamic process, which is reflected through dynamic durations, energies, and some other prosodic information when one speaks. In this paper, a novel loc… ▽ More

    Submitted 20 March, 2018; originally announced March 2018.

    Comments: 10 pages, 3 figures, accepted by ISSP 2017

  47. arXiv:1101.3124  [pdf

    cs.CR cs.CV cs.HC

    SafeVchat: Detecting Obscene Content and Misbehaving Users in Online Video Chat Services

    Authors: Xinyu Xing, Yu-Li Liang, Hanqiang Cheng, Jianxun Dang, Sui Huang, Richard Han, Xue Liu, Qin Lv, Shivakant Mishra

    Abstract: Online video chat services such as Chatroulette, Omegle, and vChatter that randomly match pairs of users in video chat sessions are fast becoming very popular, with over a million users per month in the case of Chatroulette. A key problem encountered in such systems is the presence of flashers and obscene content. This problem is especially acute given the presence of underage minors in such syste… ▽ More

    Submitted 17 January, 2011; originally announced January 2011.

    Comments: The 20th International World Wide Web Conference (WWW 2011)

  48. arXiv:1007.1473  [pdf

    cs.CR cs.NI

    Intrusions into Privacy in Video Chat Environments: Attacks and Countermeasures

    Authors: Xinyu Xing, Jianxun Dang, Richard Han, Xue Liu, Shivakant Mishra

    Abstract: Video chat systems such as Chatroulette have become increasingly popular as a way to meet and converse one-on-one via video and audio with other users online in an open and interactive manner. At the same time, security and privacy concerns inherent in such communication have been little explored. This paper presents one of the first investigations of the privacy threats found in such video chat s… ▽ More

    Submitted 8 July, 2010; originally announced July 2010.

    Comments: 8 pages, submitted to WPES

    Report number: Technical Report CU-CS 1069-10