Skip to main content

Showing 1–50 of 131 results for author: Su, D

  1. arXiv:2406.11704  [pdf, other

    cs.CL cs.AI cs.LG

    Nemotron-4 340B Technical Report

    Authors: Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek , et al. (58 additional authors not shown)

    Abstract: We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation be… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  2. arXiv:2406.11263  [pdf, other

    cs.CL cs.AI

    The Fall of ROME: Understanding the Collapse of LLMs in Model Editing

    Authors: Wanli Yang, Fei Sun, Jiajun Tan, Xinyu Ma, Du Su, Dawei Yin, Huawei Shen

    Abstract: Despite significant progress in model editing methods, their application in real-world scenarios remains challenging as they often cause large language models (LLMs) to collapse. Among them, ROME is particularly concerning, as it could disrupt LLMs with only a single edit. In this paper, we study the root causes of such collapse. Through extensive analysis, we identify two primary factors that con… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  3. arXiv:2406.04350  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Prompt-guided Precise Audio Editing with Diffusion Models

    Authors: Manjie Xu, Chenxing Li, Duzhen zhang, Dan Su, Wei Liang, Dong Yu

    Abstract: Audio editing involves the arbitrary manipulation of audio content through precise control. Although text-guided diffusion models have made significant advancements in text-to-audio generation, they still face challenges in finding a flexible and precise way to modify target events within an audio track. We present a novel approach, referred to as PPAE, which serves as a general module for diffusi… ▽ More

    Submitted 11 May, 2024; originally announced June 2024.

    Comments: Accepted by ICML 2024

  4. arXiv:2406.00976  [pdf, other

    cs.CL cs.SD eess.AS

    Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer

    Authors: Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu

    Abstract: While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio wavef… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Accept in ACL2024-main

  5. arXiv:2405.19813  [pdf, other

    cs.RO

    SLAM-based Joint Calibration of Multiple Asynchronous Microphone Arrays and Sound Source Localization

    Authors: Jiang Wang, Yuanzheng He, Daobilige Su, Katsutoshi Itoyama, Kazuhiro Nakadai, Junfeng Wu, Shoudong Huang, Youfu Li, He Kong

    Abstract: Robot audition systems with multiple microphone arrays have many applications in practice. However, accurate calibration of multiple microphone arrays remains challenging because there are many unknown parameters to be identified, including the relative transforms (i.e., orientation, translation) and asynchronous factors (i.e., initial time offset and sampling clock difference) between microphone… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: This paper was accepted to and going to appear in the IEEE Transactions on Robotics

  6. arXiv:2405.10210  [pdf, other

    cs.LG cs.SE

    GPT Store Mining and Analysis

    Authors: Dongxun Su, Yanjie Zhao, Xinyi Hou, Shenao Wang, Haoyu Wang

    Abstract: As a pivotal extension of the renowned ChatGPT, the GPT Store serves as a dynamic marketplace for various Generative Pre-trained Transformer (GPT) models, shaping the frontier of conversational AI. This paper presents an in-depth measurement study of the GPT Store, with a focus on the categorization of GPTs by topic, factors influencing GPT popularity, and the potential security risks. Our investi… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  7. arXiv:2404.09760  [pdf, other

    cs.LG cs.AI

    Effective Reinforcement Learning Based on Structural Information Principles

    Authors: Xianghua Zeng, Hao Peng, Dingli Su, Angsheng Li

    Abstract: Although Reinforcement Learning (RL) algorithms acquire sequential behavioral patterns through interactions with the environment, their effectiveness in noisy and high-dimensional scenarios typically relies on specific structural priors. In this paper, we propose a novel and general Structural Information principles-based framework for effective Decision-Making, namely SIDM, approached from an inf… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  8. arXiv:2404.09509  [pdf, other

    cs.CV

    Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder

    Authors: Chong Peng, Liqiang He, Dan Su

    Abstract: Today, there have been many achievements in learning the association between voice and face. However, most previous work models rely on cosine similarity or L2 distance to evaluate the likeness of voices and faces following contrastive learning, subsequently applied to retrieval and matching tasks. This method only considers the embeddings as high-dimensional vectors, utilizing a minimal scope of… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  9. arXiv:2403.07219  [pdf, other

    cs.CV

    Monocular Microscope to CT Registration using Pose Estimation of the Incus for Augmented Reality Cochlear Implant Surgery

    Authors: Yike Zhang, Eduardo Davalos, Dingjie Su, Ange Lou, Jack H. Noble

    Abstract: For those experiencing severe-to-profound sensorineural hearing loss, the cochlear implant (CI) is the preferred treatment. Augmented reality (AR) aided surgery can potentially improve CI procedures and hearing outcomes. Typically, AR solutions for image-guided surgery rely on optical tracking systems to register pre-operative planning information to the display so that hidden anatomy or other imp… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  10. arXiv:2402.16819  [pdf, other

    cs.CL cs.AI cs.LG

    Nemotron-4 15B Technical Report

    Authors: Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski, Annika Brundyn, James Maki, Miguel Martinez, Jiaxuan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi , et al. (2 additional authors not shown)

    Abstract: We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remai… ▽ More

    Submitted 27 February, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

  11. arXiv:2402.14083  [pdf, other

    cs.AI

    Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping

    Authors: Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul Mcvay, Michael Rabbat, Yuandong Tian

    Abstract: While Transformers have enabled tremendous progress in various application settings, such architectures still trail behind traditional symbolic planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks. This is accomplished by training an encoder-decoder Transformer model to predict the search dynamics of the $A^*$ se… ▽ More

    Submitted 26 April, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

  12. arXiv:2402.10695  [pdf, other

    cs.LG cs.AI cs.CR

    Unlink to Unlearn: Simplifying Edge Unlearning in GNNs

    Authors: Jiajun Tan, Fei Sun, Ruichen Qiu, Du Su, Huawei Shen

    Abstract: As concerns over data privacy intensify, unlearning in Graph Neural Networks (GNNs) has emerged as a prominent research frontier in academia. This concept is pivotal in enforcing the \textit{right to be forgotten}, which entails the selective removal of specific data from trained GNNs upon user request. Our research focuses on edge unlearning, a process of particular relevance to real-world applic… ▽ More

    Submitted 11 March, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: Accepted by WWW 2024 as a Short Research Paper

  13. arXiv:2402.06128  [pdf, other

    cs.LG cs.AI cs.SI

    Rethinking Node-wise Propagation for Large-scale Graph Learning

    Authors: Xunkai Li, Jingyuan Ma, Zhengyu Wu, Daohan Su, Wentao Zhang, Rong-Hua Li, Guoren Wang

    Abstract: Scalable graph neural networks (GNNs) have emerged as a promising technique, which exhibits superior predictive performance and high running efficiency across numerous large-scale graph-based web applications. However, (i) Most scalable GNNs tend to treat all nodes in graphs with the same propagation rules, neglecting their topological uniqueness; (ii) Existing node-wise propagation optimization s… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted by WWW 2024

  14. arXiv:2401.13601  [pdf, other

    cs.CL

    MM-LLMs: Recent Advances in MultiModal Large Language Models

    Authors: Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, Dong Yu

    Abstract: In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive surve… ▽ More

    Submitted 28 May, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted by ACL2024 (findings)

  15. arXiv:2401.11772  [pdf, other

    cs.LG cs.AI cs.SI

    LightDiC: A Simple yet Effective Approach for Large-scale Digraph Representation Learning

    Authors: Xunkai Li, Meihao Liao, Zhengyu Wu, Daohan Su, Wentao Zhang, Rong-Hua Li, Guoren Wang

    Abstract: Most existing graph neural networks (GNNs) are limited to undirected graphs, whose restricted scope of the captured relational information hinders their expressive capabilities and deployments in real-world scenarios. Compared with undirected graphs, directed graphs (digraphs) fit the demand for modeling more complex topological systems by capturing more intricate relationships between nodes, such… ▽ More

    Submitted 17 February, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted by VLDB 2024

  16. arXiv:2312.04111  [pdf, other

    cs.LG cs.AI cs.SI

    Breaking the Entanglement of Homophily and Heterophily in Semi-supervised Node Classification

    Authors: Henan Sun, Xunkai Li, Zhengyu Wu, Daohan Su, Rong-Hua Li, Guoren Wang

    Abstract: Recently, graph neural networks (GNNs) have shown prominent performance in semi-supervised node classification by leveraging knowledge from the graph database. However, most existing GNNs follow the homophily assumption, where connected nodes are more likely to exhibit similar feature distributions and the same labels, and such an assumption has proven to be vulnerable in a growing number of pract… ▽ More

    Submitted 10 March, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: Accepted by ICDE 2024

  17. arXiv:2310.10992  [pdf, other

    cs.SD eess.AS

    A High Fidelity and Low Complexity Neural Audio Coding

    Authors: Wenzhe Liu, Wei Xiao, Meng Wang, Shan Yang, Yupeng Shi, Yuyong Kang, Dan Su, Shidong Shang, Dong Yu

    Abstract: Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor high-frequency expression and high computational cost and storage consumption, we proposed an integrated framework that utilizes a neural network to model wide… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

  18. arXiv:2309.12792  [pdf, other

    eess.AS cs.SD

    DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis

    Authors: Yu Gu, Yianrao Bian, Guangzhi Lei, Chao Weng, Dan Su

    Abstract: This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed Du… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

  19. arXiv:2309.08058  [pdf, other

    cs.CR cs.SE

    Unleashing the Adversarial Facet of Software Debloating

    Authors: Do-Men Su, Mohannad Alhanahnah

    Abstract: Software debloating techniques are applied to craft a specialized version of the program based on the user's requirements and remove irrelevant code accordingly. The debloated programs presumably maintain better performance and reduce the attack surface in contrast to the original programs. This work unleashes the effectiveness of applying software debloating techniques on the robustness of machin… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

  20. Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

    Authors: Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu, Zhao You, Dan Su, Dong Yu, Helen Meng

    Abstract: Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not… ▽ More

    Submitted 7 October, 2023; v1 submitted 4 September, 2023; originally announced September 2023.

    Comments: Proceedings of Interspeech. arXiv admin note: text overlap with arXiv:2309.01437

  21. arXiv:2306.13518  [pdf, other

    cs.CV cs.RO

    Segmentation and Tracking of Vegetable Plants by Exploiting Vegetable Shape Feature for Precision Spray of Agricultural Robots

    Authors: Nan Hu, Daobilige Su, Shuo Wang, Xuechang Wang, Huiyu Zhong, Zimeng Wang, Yongliang Qiao, Yu Tan

    Abstract: With the increasing deployment of agricultural robots, the traditional manual spray of liquid fertilizer and pesticide is gradually being replaced by agricultural robots. For robotic precision spray application in vegetable farms, accurate plant phenotyping through instance segmentation and robust plant tracking are of great importance and a prerequisite for the following spray action. Regarding t… ▽ More

    Submitted 26 June, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

  22. arXiv:2305.12178  [pdf, other

    cs.LG cs.CY

    Model Debiasing via Gradient-based Explanation on Representation

    Authors: Jindi Zhang, Luning Wang, Dan Su, Yongxiang Huang, Caleb Chen Cao, Lei Chen

    Abstract: Machine learning systems produce biased results towards certain demographic groups, known as the fairness problem. Recent approaches to tackle this problem learn a latent code (i.e., representation) through disentangled representation learning and then discard the latent code dimensions correlated with sensitive attributes (e.g., gender). Nevertheless, these approaches may suffer from incomplete d… ▽ More

    Submitted 3 September, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

  23. arXiv:2304.11220  [pdf, other

    cs.CL

    Learn What NOT to Learn: Towards Generative Safety in Chatbots

    Authors: Leila Khalatbari, Yejin Bang, Dan Su, Willy Chung, Saeed Ghadimi, Hossein Sameti, Pascale Fung

    Abstract: Conversational models that are generative and open-domain are particularly susceptible to generating unsafe content since they are trained on web-based social data. Prior approaches to mitigating this issue have drawbacks, such as disrupting the flow of conversation, limited generalization to unseen toxic input contexts, and sacrificing the quality of the dialogue for the sake of safety. In this p… ▽ More

    Submitted 25 April, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

    Comments: 9 pages, 3 tables, 3 figures

  24. arXiv:2302.04023  [pdf, other

    cs.CL cs.AI

    A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

    Authors: Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung

    Abstract: This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset.… ▽ More

    Submitted 28 November, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

    Comments: 45 pages, AACL 2023

  25. arXiv:2301.00656  [pdf, other

    eess.AS cs.CL cs.LG

    TriNet: stabilizing self-supervised learning from complete or slow collapse on ASR

    Authors: Lixin Cao, Jun Wang, Ben Yang, Dan Su, Dong Yu

    Abstract: Self-supervised learning (SSL) models confront challenges of abrupt informational collapse or slow dimensional collapse. We propose TriNet, which introduces a novel triple-branch architecture for preventing collapse and stabilizing the pre-training. TriNet learns the SSL latent embedding space and incorporates it to a higher level space for predicting pseudo target vectors generated by a frozen te… ▽ More

    Submitted 14 March, 2023; v1 submitted 12 December, 2022; originally announced January 2023.

    Comments: Accepted by ICASSP 2023

  26. arXiv:2212.09648  [pdf, other

    cs.CL cs.AI

    NusaCrowd: Open Source Initiative for Indonesian NLP Resources

    Authors: Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Fajri Koto, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Ivan Halim Parmonangan, Ika Alfina, Muhammad Satrio Wicaksono, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Akbar Septiandri, James Jaya, Kaustubh D. Dhole, Arie Ardiyanti Suryani, Rifki Afina Putri , et al. (22 additional authors not shown)

    Abstract: We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple exp… ▽ More

    Submitted 21 July, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

  27. arXiv:2212.01546  [pdf, other

    cs.SD eess.AS

    UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

    Authors: Yi Lei, Shan Yang, Xinsheng Wang, Qicong Xie, Jixun Yao, Lei Xie, Dan Su

    Abstract: Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or… ▽ More

    Submitted 6 December, 2022; v1 submitted 3 December, 2022; originally announced December 2022.

  28. arXiv:2211.08386  [pdf, other

    cs.CL cs.AI

    Generative Long-form Question Answering: Relevance, Faithfulness and Succinctness

    Authors: Dan Su

    Abstract: In this thesis, we investigated the relevance, faithfulness, and succinctness aspects of Long Form Question Answering (LFQA). LFQA aims to generate an in-depth, paragraph-length answer for a given question, to help bridge the gap between real scenarios and the existing open-domain QA models which can only extract short-span answers. LFQA is quite challenging and under-explored. Few works have been… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: PhD Thesis

  29. arXiv:2210.07688  [pdf, other

    cs.CL cs.CV

    Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training

    Authors: Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, Pascale Fung

    Abstract: Large-scale vision-language pre-trained (VLP) models are prone to hallucinate non-existent visual objects when generating text based on visual information. In this paper, we systematically study the object hallucination problem from three aspects. First, we examine recent state-of-the-art VLP models, showing that they still hallucinate frequently, and models achieving better scores on standard met… ▽ More

    Submitted 9 February, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

    Comments: Accepted at EACL 2023

  30. arXiv:2210.06349  [pdf, other

    cs.CL cs.AI

    Context Generation Improves Open Domain Question Answering

    Authors: Dan Su, Mostofa Patwary, Shrimai Prabhumoye, Peng Xu, Ryan Prenger, Mohammad Shoeybi, Pascale Fung, Anima Anandkumar, Bryan Catanzaro

    Abstract: Closed-book question answering (QA) requires a model to directly answer an open-domain question without access to any external knowledge. Prior work on closed-book QA either directly finetunes or prompts a pretrained language model (LM) to leverage the stored knowledge. However, they do not fully exploit the parameterized knowledge. To address this issue, we propose a two-stage, closed-book QA fra… ▽ More

    Submitted 27 April, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: 8 pages; Accepted at EACL2023

  31. arXiv:2210.05600  [pdf, ps, other

    cs.RO

    Observability Analysis of Graph SLAM-Based Joint Calibration of Multiple Microphone Arrays and Sound Source Localization

    Authors: Yuanzheng He, Jiang Wang, Daobilige Su, Kazuhiro Nakadai, Junfeng Wu, Shoudong Huang, Youfu Li, He Kong

    Abstract: Multiple microphone arrays have many applications in robot audition, including sound source localization, audio scene perception and analysis, etc. However, accurate calibration of multiple microphone arrays remains a challenge because there are many unknown parameters to be identified, including the Euler angles, geometry, asynchronous factors between the microphone arrays. This paper is concerne… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: This paper is accepted to and going to be presented at the 2023 IEEE/SICE International Symposium on System Integrations, Atlanta, USA

  32. arXiv:2210.05092  [pdf, other

    cs.SD eess.AS

    The DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022

    Authors: Xiaoyi Qin, Na Li, Yuke Lin, Yiwei Ding, Chao Weng, Dan Su, Ming Li

    Abstract: This paper is the system description of the DKU-Tencent System for the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC22). In this challenge, we focus on track1 and track3. For track1, multiple backbone networks are adopted to extract frame-level features. Since track1 focus on the cross-age scenarios, we adopt the cross-age trials and perform QMF to calibrate score. The magnitude-based qualit… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

  33. arXiv:2209.01750  [pdf, ps, other

    cs.LG cs.NI

    Boost Decentralized Federated Learning in Vehicular Networks by Diversifying Data Sources

    Authors: Dongyuan Su, Yipeng Zhou, Laizhong Cui

    Abstract: Recently, federated learning (FL) has received intensive research because of its ability in preserving data privacy for scattered clients to collaboratively train machine learning models. Commonly, a parameter server (PS) is deployed for aggregating model parameters contributed by different clients. Decentralized federated learning (DFL) is upgraded from FL which allows clients to aggregate model… ▽ More

    Submitted 5 September, 2022; originally announced September 2022.

    Comments: To appear in the 30th IEEE International Conference on Network Protocols (IEEE ICNP 2022)

  34. arXiv:2207.08894  [pdf, other

    cs.LG cs.AI cs.GT

    A Deep Reinforcement Learning Approach for Finding Non-Exploitable Strategies in Two-Player Atari Games

    Authors: Zihan Ding, Dijia Su, Qinghua Liu, Chi Jin

    Abstract: This paper proposes new, end-to-end deep reinforcement learning algorithms for learning two-player zero-sum Markov games. Different from prior efforts on training agents to beat a fixed set of opponents, our objective is to find the Nash equilibrium policies that are free from exploitation by even the adversarial opponents. We propose (a) Nash-DQN algorithm, which integrates the deep learning tech… ▽ More

    Submitted 6 March, 2023; v1 submitted 18 July, 2022; originally announced July 2022.

  35. arXiv:2207.05929  [pdf, other

    eess.AS cs.SD

    Cross-Age Speaker Verification: Learning Age-Invariant Speaker Embeddings

    Authors: Xiaoyi Qin, Na Li, Chao Weng, Dan Su, Ming Li

    Abstract: Automatic speaker verification has achieved remarkable progress in recent years. However, there is little research on cross-age speaker verification (CASV) due to insufficient relevant data. In this paper, we mine cross-age test sets based on the VoxCeleb dataset and propose our age-invariant speaker representation(AISR) learning method. Since the VoxCeleb is collected from the YouTube platform, t… ▽ More

    Submitted 12 July, 2022; originally announced July 2022.

    Comments: Accepted by Interspeech2022

  36. arXiv:2207.01832  [pdf, other

    cs.SD eess.AS

    Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

    Authors: Yi Lei, Shan Yang, Jian Cong, Lei Xie, Dan Su

    Abstract: The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages -- acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming… ▽ More

    Submitted 5 July, 2022; originally announced July 2022.

  37. arXiv:2207.00756  [pdf, other

    cs.SD eess.AS

    Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers

    Authors: Liumeng Xue, Shan Yang, Na Hu, Dan Su, Lei Xie

    Abstract: Building a voice conversion system for noisy target speakers, such as users providing noisy samples or Internet found data, is a challenging task since the use of contaminated speech in model training will apparently degrade the conversion performance. In this paper, we leverage the advances of our recently proposed Glow-WaveGAN and propose a noise-independent speech representation learning approa… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Comments: Accepted by INTERSPEECH 2022

  38. arXiv:2206.07569  [pdf, other

    eess.AS cs.SD

    End-to-End Voice Conversion with Information Perturbation

    Authors: Qicong Xie, Shan Yang, Yi Lei, Lei Xie, Dan Su

    Abstract: The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

  39. arXiv:2206.03970  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Narrowing the Coordinate-frame Gap in Behavior Prediction Models: Distillation for Efficient and Accurate Scene-centric Motion Forecasting

    Authors: DiJia Su, Bertrand Douillard, Rami Al-Rfou, Cheolho Park, Benjamin Sapp

    Abstract: Behavior prediction models have proliferated in recent years, especially in the popular real-world robotics application of autonomous driving, where representing the distribution over possible futures of moving agents is essential for safe and comfortable motion planning. In these models, the choice of coordinate frames to represent inputs and outputs has crucial trade offs which broadly fall into… ▽ More

    Submitted 10 June, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

    Comments: Accepted at ICRA 2022

  40. arXiv:2206.00208  [pdf, other

    cs.SD eess.AS

    AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

    Authors: Kun Song, Heyang Xue, Xinsheng Wang, Jian Cong, Yongmao Zhang, Lei Xie, Bing Yang, Xiong Zhang, Dan Su

    Abstract: Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pre-trained TTS model to adapt to new target speakers with limited data. While much effort has been conducted towards this task, seldom work has been performed for low computational resource scenarios due to the challenges raised by the requirement of the lightweight model and less computational complexity. In this paper, a tiny… ▽ More

    Submitted 2 November, 2022; v1 submitted 31 May, 2022; originally announced June 2022.

    Comments: Accepted by ISCSLP 2022

  41. arXiv:2205.05989  [pdf, other

    cs.CL cs.AI cs.LG

    Towards Answering Open-ended Ethical Quandary Questions

    Authors: Yejin Bang, Nayeon Lee, Tiezheng Yu, Leila Khalatbari, Yan Xu, Samuel Cahyawijaya, Dan Su, Bryan Wilie, Romain Barraud, Elham J. Barezi, Andrea Madotto, Hayden Kee, Pascale Fung

    Abstract: Considerable advancements have been made in various NLP tasks based on the impressive power of large language models (LLMs) and many NLP applications are deployed in our daily lives. In this work, we challenge the capability of LLMs with the new task of Ethical Quandary Generative Question Answering. Ethical quandary questions are more challenging to address because multiple conflicting answers ma… ▽ More

    Submitted 1 February, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: 16 pages

  42. arXiv:2204.12645  [pdf

    cs.HC

    Circular cartograms via the elastic beam algorithm originated from cartographic generalization

    Authors: Wei Zhiwei, Ding Su, Xu Wenjia, Cheng Lu, Zhang Song, Wang Yang

    Abstract: The circular cartogram, also known as the Dorling map, is a widely used tool for visualizing statistical data. It represents regions as circles with their areas in proportion to the statistical values and requires circle displacement to avoid overlap and maintain spatial relationships. In this paper, we propose a new approach for circular cartogram production that utilizes the elastic beam displac… ▽ More

    Submitted 24 April, 2023; v1 submitted 26 April, 2022; originally announced April 2022.

    Comments: 25 pages,9 figures

  43. arXiv:2204.09934  [pdf, other

    eess.AS cs.LG cs.SD

    FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

    Authors: Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao

    Abstract: Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of div… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

    Comments: Accepted by IJCAI 2022

  44. arXiv:2204.03178  [pdf, other

    cs.SD cs.CL eess.AS

    3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

    Authors: Zhao You, Shulin Feng, Dan Su, Dong Yu

    Abstract: Recently, Conformer based CTC/AED model has become a mainstream architecture for ASR. In this paper, based on our prior work, we identify and integrate several approaches to achieve further improvements for ASR tasks, which we denote as multi-loss, multi-path and multi-level, summarized as "3M" model. Specifically, multi-loss refers to the joint CTC/AED loss and multi-path denotes the Mixture-of-E… ▽ More

    Submitted 14 April, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: 5 pages, 1 figure. Submitted to INTERSPEECH 2022

  45. arXiv:2204.00990  [pdf, other

    cs.SD eess.AS

    Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis

    Authors: Yixuan Zhou, Changhe Song, Xiang Li, Luwen Zhang, Zhiyong Wu, Yanyao Bian, Dan Su, Helen Meng

    Abstract: Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and parameters. Previous researches usually use a speaker encoder to extract a global fixed speaker embedding from reference speech, and several attempts have tried variable-length speaker embedding. However, they neglect to transfer the personal pronunciation characteristics related to phoneme content… ▽ More

    Submitted 11 November, 2022; v1 submitted 3 April, 2022; originally announced April 2022.

    Comments: Accepted by Interspeech 2022

  46. arXiv:2203.13508  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

    Authors: Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

    Abstract: Diffusion probabilistic models (DPMs) and their extensions have emerged as competitive generative models yet confront challenges of efficient sampling. We propose a new bilateral denoising diffusion model (BDDM) that parameterizes both the forward and reverse processes with a schedule network and a score network, which can train with a novel bilateral modeling objective. We show that the new surro… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

    Comments: Accepted in ICLR 2022. arXiv admin note: text overlap with arXiv:2108.11514

    Journal ref: International Conference on Learning Representations 2022

  47. arXiv:2203.00343  [pdf, other

    cs.CL cs.AI

    Read before Generate! Faithful Long Form Question Answering with Machine Reading

    Authors: Dan Su, Xiaoguang Li, Jindi Zhang, Lifeng Shang, Xin Jiang, Qun Liu, Pascale Fung

    Abstract: Long-form question answering (LFQA) aims to generate a paragraph-length answer for a given question. While current work on LFQA using large pre-trained model for generation are effective at producing fluent and somewhat relevant content, one primary challenge lies in how to generate a faithful answer that has less hallucinated content. We propose a new end-to-end framework that jointly models answ… ▽ More

    Submitted 1 March, 2022; originally announced March 2022.

    Comments: long paper, accepted to ACL 2022 findings

  48. arXiv:2202.09081  [pdf, other

    eess.AS cs.AI cs.CV cs.MM cs.SD eess.IV

    VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

    Authors: Disong Wang, Shan Yang, Dan Su, Xunying Liu, Dong Yu, Helen Meng

    Abstract: Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quanti… ▽ More

    Submitted 18 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022. Demo page is available at https://wendison.github.io/VCVTS-demo/

  49. arXiv:2202.06538  [pdf, other

    cs.CL cs.AI

    QA4QG: Using Question Answering to Constrain Multi-Hop Question Generation

    Authors: Dan Su, Peng Xu, Pascale Fung

    Abstract: Multi-hop question generation (MQG) aims to generate complex questions which require reasoning over multiple pieces of information of the input passage. Most existing work on MQG has focused on exploring graph-based networks to equip the traditional Sequence-to-sequence framework with reasoning ability. However, these models do not take full advantage of the constraint between questions and answer… ▽ More

    Submitted 14 February, 2022; originally announced February 2022.

    Comments: 4 pages, accepted by ICASSP2022

  50. Survey of Hallucination in Natural Language Generation

    Authors: Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Delong Chen, Wenliang Dai, Ho Shu Chan, Andrea Madotto, Pascale Fung

    Abstract: Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However,… ▽ More

    Submitted 14 July, 2024; v1 submitted 7 February, 2022; originally announced February 2022.

    ACM Class: A.1

    Journal ref: ACM Computing Surveys (2022)