Skip to main content

Showing 1–50 of 369 results for author: Qian, Y

  1. arXiv:2407.02477  [pdf, other

    cs.CV cs.CL

    Understanding Alignment in Multimodal LLMs: A Comprehensive Study

    Authors: Elmira Amirloo, Jean-Philippe Fauconnier, Christoph Roesmann, Christian Kerl, Rinu Boney, Yusu Qian, Zirui Wang, Afshin Dehghan, Yinfei Yang, Zhe Gan, Peter Grasch

    Abstract: Preference alignment has become a crucial component in enhancing the performance of Large Language Models (LLMs), yet its impact in Multimodal Large Language Models (MLLMs) remains comparatively underexplored. Similar to language models, MLLMs for image understanding tasks encounter challenges like hallucination. In MLLMs, hallucination can occur not only by stating incorrect facts but also by pro… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  2. arXiv:2407.01509  [pdf, other

    cs.CV cs.CL

    MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

    Authors: Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan

    Abstract: We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results fro… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  3. arXiv:2407.00105  [pdf, other

    cs.LG cs.AI

    Multiple Kronecker RLS fusion-based link propagation for drug-side effect prediction

    Authors: Yuqing Qian, Ziyu Zheng, Prayag Tiwari, Yijie Ding, Quan Zou

    Abstract: Drug-side effect prediction has become an essential area of research in the field of pharmacology. As the use of medications continues to rise, so does the importance of understanding and mitigating the potential risks associated with them. At present, researchers have turned to data-driven methods to predict drug-side effects. Drug-side effect prediction is a link prediction problem, and the rela… ▽ More

    Submitted 27 June, 2024; originally announced July 2024.

    Comments: Transactions on Machine Learning Research (TMLR 2024)

  4. arXiv:2406.13471  [pdf, other

    eess.AS cs.SD

    Diffusion-based Generative Modeling with Discriminative Guidance for Streamable Speech Enhancement

    Authors: Chenda Li, Samuele Cornell, Shinji Watanabe, Yanmin Qian

    Abstract: Diffusion-based generative models (DGMs) have recently attracted attention in speech enhancement research (SE) as previous works showed a remarkable generalization capability. However, DGMs are also computationally intensive, as they usually require many iterations in the reverse diffusion process (RDP), making them impractical for streaming SE systems. In this paper, we propose to use discriminat… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  5. arXiv:2406.11740  [pdf, other

    cs.RO cs.AI cs.LG

    Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies

    Authors: Haojie Huang, Karl Schmeckpeper, Dian Wang, Ondrej Biza, Yaoyao Qian, Haotian Liu, Mingxi Jia, Robert Platt, Robin Walters

    Abstract: Humans can imagine goal states during planning and perform actions to match those goals. In this work, we propose Imagination Policy, a novel multi-task key-frame policy network for solving high-precision pick and place tasks. Instead of learning actions directly, Imagination Policy generates point clouds to imagine desired states which are then translated to actions using rigid action estimation.… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  6. arXiv:2406.11364  [pdf, other

    cs.SD eess.AS

    AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection

    Authors: Anbai Jiang, Bing Han, Zhiqiang Lv, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, Pingyi Fan

    Abstract: Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, res… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  7. arXiv:2406.08812  [pdf, other

    cs.SD eess.AS

    Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

    Authors: Zhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, Yanmin Qian

    Abstract: This paper proposes a speech synthesis system that allows users to specify and control the acoustic characteristics of a speaker by means of prompts describing the speaker's traits of synthesized speech. Unlike previous approaches, our method utilizes listener impressions to construct prompts, which are easier to collect and align more naturally with everyday descriptions of speaker traits. We ado… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted for presentation at Interspeech 2024 (with more analysis in the final Appendix part)

  8. arXiv:2406.07855  [pdf, other

    cs.CL cs.SD eess.AS

    VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

    Authors: Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, Furu Wei

    Abstract: With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings h… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 15 pages, 5 figures

  9. arXiv:2406.07198  [pdf, other

    eess.AS cs.MM

    Target Speech Diarization with Multimodal Prompts

    Authors: Yidi Jiang, Ruijie Tao, Zhengyang Chen, Yanmin Qian, Haizhou Li

    Abstract: Traditional speaker diarization seeks to detect ``who spoke when'' according to speaker characteristics. Extending to target speech diarization, we detect ``when target event occurs'' according to the semantic characteristics of speech. We propose a novel Multimodal Target Speech Diarization (MM-TSD) framework, which accommodates diverse and multi-modal prompts to specify target events in a flexib… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 13 pages, 7 figures

  10. arXiv:2406.06350  [pdf, other

    math.NA cs.LG physics.comp-ph

    Error Analysis and Numerical Algorithm for PDE Approximation with Hidden-Layer Concatenated Physics Informed Neural Networks

    Authors: Yianxia Qian, Yongchao Zhang, Suchuan Dong

    Abstract: We present the hidden-layer concatenated physics informed neural network (HLConcPINN) method, which combines hidden-layer concatenated feed-forward neural networks, a modified block time marching strategy, and a physics informed approach for approximating partial differential equations (PDEs). We analyze the convergence properties and establish the error bounds of this method for two types of PDEs… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 40 pages, 10 tables, 18 figures

  11. arXiv:2406.05370  [pdf, other

    cs.CL cs.SD eess.AS

    VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

    Authors: Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei

    Abstract: This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in… ▽ More

    Submitted 17 June, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

    Comments: Demo posted

  12. arXiv:2406.05359  [pdf, other

    eess.AS cs.SD

    Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization

    Authors: Bei Liu, Haoyu Wang, Yanmin Qian

    Abstract: Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids custom… ▽ More

    Submitted 18 June, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

    Comments: submitted to IEEE/ACM Transactions on Audio Speech and Language Processing (Under Review)

  13. arXiv:2406.04660  [pdf, other

    eess.AS cs.SD

    URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

    Authors: Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian

    Abstract: The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generaliza… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix

  14. arXiv:2406.04269  [pdf, other

    eess.AS cs.SD

    Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

    Authors: Wangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe, Yanmin Qian

    Abstract: Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement.… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: 5 pages, 3 figures, 4 tables, Accepted by Interspeech 2024

  15. arXiv:2405.17809  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

    Authors: Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie Liu, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Sheng Zhao, Michael Zeng

    Abstract: There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complex… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: Work in progress

  16. arXiv:2405.17233  [pdf, other

    cs.LG

    CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

    Authors: Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

    Abstract: Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantizatio… ▽ More

    Submitted 2 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

  17. arXiv:2405.15580  [pdf, other

    cs.CV

    Open-Vocabulary SAM3D: Understand Any 3D Scene

    Authors: Hanchen Tai, Qingdong He, Jiangning Zhang, Yijie Qian, Zhenyu Zhang, Xiaobin Hu, Yabiao Wang, Yong Liu

    Abstract: Open-vocabulary 3D scene understanding presents a significant challenge in the field. Recent advancements have sought to transfer knowledge embedded in vision language models from the 2D domain to 3D domain. However, these approaches often require learning prior knowledge from specific 3D scene datasets, which limits their applicability in open-world scenarios. The Segment Anything Model (SAM) has… ▽ More

    Submitted 21 June, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: Project page: https://hithqd.github.io/projects/OV-SAM3D

  18. arXiv:2405.13080  [pdf, other

    cs.CR cs.LG

    EmInspector: Combating Backdoor Attacks in Federated Self-Supervised Learning Through Embedding Inspection

    Authors: Yuwen Qian, Shuchi Wu, Kang Wei, Ming Ding, Di Xiao, Tao Xiang, Chuan Ma, Song Guo

    Abstract: Federated self-supervised learning (FSSL) has recently emerged as a promising paradigm that enables the exploitation of clients' vast amounts of unlabeled data while preserving data privacy. While FSSL offers advantages, its susceptibility to backdoor attacks, a concern identified in traditional federated supervised learning (FSL), has not been investigated. To fill the research gap, we undertake… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Comments: 18 pages, 12 figures

  19. arXiv:2405.12944  [pdf, other

    cs.CV

    AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection

    Authors: Zizhao Chen, Yeqiang Qian, Xiaoxiao Yang, Chunxiang Wang, Ming Yang

    Abstract: Multispectral pedestrian detection has been shown to be effective in improving performance within complex illumination scenarios. However, prevalent double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data, leading to nearly double the inference time compared to single-stream networks utilizing only one feature extraction branch. This i… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  20. arXiv:2405.12914  [pdf, other

    cs.CV

    An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

    Authors: Zhiyu Tan, Mengping Yang, Luozheng Qin, Hao Yang, Ye Qian, Qiang Zhou, Cheng Zhang, Hao Li

    Abstract: One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Langu… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Comments: Technical report. Project page: https://github.com/llm-conditioned-diffusion/llm-conditioned-diffusion

  21. arXiv:2405.06510  [pdf, other

    cs.AI

    UniDM: A Unified Framework for Data Manipulation with Large Language Models

    Authors: Yichen Qian, Yongyi He, Rong Zhu, Jintao Huang, Zhijian Ma, Haibin Wang, Yaohua Wang, Xiuyu Sun, Defu Lian, Bolin Ding, Jingren Zhou

    Abstract: Designing effective data manipulation methods is a long standing problem in data lakes. Traditional methods, which rely on rules or machine learning models, require extensive human efforts on training data collection and tuning models. Recent methods apply Large Language Models (LLMs) to resolve multiple data manipulation tasks. They exhibit bright benefits in terms of performance but still requir… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: MLSys24

  22. arXiv:2404.19040  [pdf, other

    cs.CV

    GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting

    Authors: Bo Chen, Shoukang Hu, Qi Chen, Chenpeng Du, Ran Yi, Yanmin Qian, Xie Chen

    Abstract: We present GStalker, a 3D audio-driven talking face generation model with Gaussian Splatting for both fast training (40 minutes) and real-time rendering (125 FPS) with a 3$\sim$5 minute video for training material, in comparison with previous 2D and 3D NeRF-based modeling frameworks which require hours of training and seconds of rendering per frame. Specifically, GSTalker learns an audio-driven Ga… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

  23. arXiv:2404.13279  [pdf, other

    cs.CR eess.IV eess.SP

    Backdoor Attacks and Defenses on Semantic-Symbol Reconstruction in Semantic Communications

    Authors: Yuan Zhou, Rose Qingyang Hu, Yi Qian

    Abstract: Semantic communication is of crucial importance for the next-generation wireless communication networks. The existing works have developed semantic communication frameworks based on deep learning. However, systems powered by deep learning are vulnerable to threats such as backdoor attacks and adversarial attacks. This paper delves into backdoor attacks targeting deep learning-enabled semantic comm… ▽ More

    Submitted 20 April, 2024; originally announced April 2024.

    Comments: This paper has been accepted by IEEE ICC 2024

  24. arXiv:2404.11035  [pdf, other

    cs.IT cs.DC cs.NI

    Approximate Wireless Communication for Lossy Gradient Updates in IoT Federated Learning

    Authors: Xiang Ma, Haijian Sun, Rose Qingyang Hu, Yi Qian

    Abstract: Federated learning (FL) has emerged as a distributed machine learning (ML) technique that can protect local data privacy for participating clients and improve system efficiency. Instead of sharing raw data, FL exchanges intermediate learning parameters, such as gradients, among clients. This article presents an efficient wireless communication approach tailored for FL parameter transmission, espec… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    Comments: submitted to IEEE journals for publication

  25. arXiv:2404.06690  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

    Authors: Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng

    Abstract: Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-rou… ▽ More

    Submitted 29 May, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  26. arXiv:2404.01462  [pdf, other

    cs.LG cs.CL cs.IR

    OpenChemIE: An Information Extraction Toolkit For Chemistry Literature

    Authors: Vincent Fan, Yujie Qian, Alex Wang, Amber Wang, Connor W. Coley, Regina Barzilay

    Abstract: Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extractio… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: To be submitted to the Journal of Chemical Information and Modeling

  27. arXiv:2404.00878  [pdf, other

    cs.CV

    TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

    Authors: Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong Liu, Jingdong Wang

    Abstract: Virtual try-on focuses on adjusting the given clothes to fit a specific person seamlessly while avoiding any distortion of the patterns and textures of the garment. However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the wide… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

  28. arXiv:2404.00875  [pdf, other

    cs.CV

    DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly

    Authors: Fenggen Yu, Yiming Qian, Xu Zhang, Francisca Gil-Ureta, Brian Jackson, Eric Bennett, Hao Zhang

    Abstract: We present a differentiable rendering framework to learn structured 3D abstractions in the form of primitive assemblies from sparse RGB images capturing a 3D object. By leveraging differentiable volume rendering, our method does not require 3D supervision. Architecturally, our network follows the general pipeline of an image-conditioned neural radiance field (NeRF) exemplified by pixelNeRF for col… ▽ More

    Submitted 2 April, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

    Comments: 14 pages

  29. arXiv:2404.00589   

    cs.LG cs.CL

    Harnessing the Power of Large Language Model for Uncertainty Aware Graph Processing

    Authors: Zhenyu Qian, Yiming Qian, Yuting Song, Fei Gao, Hai Jin, Chen Yu, Xia Xie

    Abstract: Handling graph data is one of the most difficult tasks. Traditional techniques, such as those based on geometry and matrix factorization, rely on assumptions about the data relations that become inadequate when handling large and complex graph data. On the other hand, deep learning approaches demonstrate promising results in handling large graph data, but they often fall short of providing interpr… ▽ More

    Submitted 12 April, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

    Comments: Because my organization does not allow members to privately upload papers to arXiv, I am requesting a withdrawal of my submission

  30. arXiv:2403.17870  [pdf, other

    cs.CV cs.MM

    Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

    Authors: Yurui Qian, Qi Cai, Yingwei Pan, Yehao Li, Ting Yao, Qibin Sun, Tao Mei

    Abstract: Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prio… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  31. arXiv:2403.16002  [pdf, other

    cs.CV

    SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

    Authors: Xiaojun Hou, Jiazheng Xing, Yijie Qian, Yaowei Guo, Shuo Xin, Junhao Chen, Kai Tang, Mengmeng Wang, Zhengkai Jiang, Liang Liu, Yong Liu

    Abstract: Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the m… ▽ More

    Submitted 27 March, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR2024

  32. Advanced Long-Content Speech Recognition With Factorized Neural Transducer

    Authors: Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

    Abstract: In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that th… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Accepted by TASLP 2024

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1803-1815, 2024

  33. arXiv:2403.00673  [pdf, other

    cs.LG

    Snapshot Reinforcement Learning: Leveraging Prior Trajectories for Efficiency

    Authors: Yanxiao Zhao, Yangge Qian, Tianyi Wang, Jingyang Shan, Xiaolin Qin

    Abstract: Deep reinforcement learning (DRL) algorithms require substantial samples and computational resources to achieve higher performance, which restricts their practical application and poses challenges for further development. Given the constraint of limited resources, it is essential to leverage existing computational work (e.g., learned policies, samples) to enhance sample efficiency and reduce the c… ▽ More

    Submitted 12 March, 2024; v1 submitted 1 March, 2024; originally announced March 2024.

    Comments: Under review

  34. arXiv:2402.15173  [pdf, other

    cs.LG

    Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer

    Authors: Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, Ivor W. Tsang

    Abstract: Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

  35. arXiv:2402.13220  [pdf, other

    cs.CV cs.CL

    How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

    Authors: Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan

    Abstract: The remarkable advancements in Multimodal Large Language Models (MLLMs) have not rendered them immune to challenges, particularly in the context of handling deceptive information in prompts, thus producing hallucinated responses under such conditions. To quantitatively assess this vulnerability, we present MAD-Bench, a carefully curated benchmark that contains 850 test samples divided into 6 categ… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  36. arXiv:2402.08875  [pdf, other

    cs.CV

    Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos

    Authors: Yang Qian, Yinan Sun, Ali Kargarandehkordi, Onur Cezmi Mutlu, Saimourya Surabhi, Pingyi Chen, Zain Jabbar, Dennis Paul Wall, Peter Washington

    Abstract: The increasing variety and quantity of tagged multimedia content on platforms such as TikTok provides an opportunity to advance computer vision modeling. We have curated a distinctive dataset of 283,582 unique video clips categorized under 386 hashtags relating to modern human actions. We release this dataset as a valuable resource for building domain-specific foundation models for human movement… ▽ More

    Submitted 19 May, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

    Comments: 10 pages

  37. arXiv:2402.05830  [pdf, other

    cs.LG cs.AI

    Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting

    Authors: Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin

    Abstract: Time series analysis is vital for numerous applications, and transformers have become increasingly prominent in this domain. Leading methods customize the transformer architecture from NLP and CV, utilizing a patching technique to convert continuous signals into segments. Yet, time series data are uniquely challenging due to significant distribution shifts and intrinsic noise levels. To address th… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  38. arXiv:2402.04359  [pdf, other

    cs.LG

    Adaptive Inference: Theoretical Limits and Unexplored Opportunities

    Authors: Soheil Hor, Ying Qian, Mert Pilanci, Amin Arbabian

    Abstract: This paper introduces the first theoretical framework for quantifying the efficiency and performance gain opportunity size of adaptive inference algorithms. We provide new approximate and exact bounds for the achievable efficiency and performance gains, supported by empirical evidence demonstrating the potential for 10-100x efficiency improvements in both Computer Vision and Natural Language Proce… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

  39. arXiv:2401.15989  [pdf, other

    cs.LG

    Deep Embedding Clustering Driven by Sample Stability

    Authors: Zhanwen Cheng, Feijiang Li, Jieting Wang, Yuhua Qian

    Abstract: Deep clustering methods improve the performance of clustering tasks by jointly optimizing deep representation learning and clustering. While numerous deep clustering algorithms have been proposed, most of them rely on artificially constructed pseudo targets for performing clustering. This construction process requires some prior knowledge, and it is challenging to determine a suitable pseudo targe… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: 8 pages,5 figures,submitted to a conference

  40. arXiv:2401.14271  [pdf, other

    eess.AS cs.SD

    Improving Design of Input Condition Invariant Speech Enhancement

    Authors: Wangyou Zhang, Jee-weon Jung, Shinji Watanabe, Yanmin Qian

    Abstract: Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model wa… ▽ More

    Submitted 15 February, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024, 5 pages, 2 figures, 3 tables (corrected the results of no processing on CHiME-4 (Simu) in Table 2)

  41. arXiv:2401.13246  [pdf, other

    cs.CL

    SEER: Facilitating Structured Reasoning and Explanation via Reinforcement Learning

    Authors: Guoxin Chen, Kexin Tang, Chao Yang, Fuying Ye, Yu Qiao, Yiming Qian

    Abstract: Elucidating the reasoning process with structured explanations from question to answer is crucial, as it significantly enhances the interpretability, traceability, and trustworthiness of question-answering (QA) systems. However, structured explanations demand models to perform intricately structured reasoning, which poses great challenges. Most existing methods focus on single-step reasoning throu… ▽ More

    Submitted 4 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Camera ready version for ACL 2024 Main Conference

  42. arXiv:2401.09673  [pdf, other

    cs.CV cs.CR cs.LG eess.IV

    Artwork Protection Against Neural Style Transfer Using Locally Adaptive Adversarial Color Attack

    Authors: Zhongliang Guo, Junhao Dong, Yifei Qian, Kaixuan Wang, Weiye Li, Ziheng Guo, Yuheng Wang, Yanli Li, Ognjen Arandjelović, Lei Fang

    Abstract: Neural style transfer (NST) generates new images by combining the style of one image with the content of another. However, unauthorized NST can exploit artwork, raising concerns about artists' rights and motivating the development of proactive protection methods. We propose Locally Adaptive Adversarial Color Attack (LAACA), empowering artists to protect their artwork from unauthorized style transf… ▽ More

    Submitted 19 April, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

    Comments: 9 pages, 5 figures, 4 tables

  43. arXiv:2401.08915  [pdf, other

    stat.AP cs.CY

    How do transportation professionals perceive the impacts of AI applications in transportation? A latent class cluster analysis

    Authors: Yiheng Qian, Tejaswi Polimetla, Thomas W. Sanchez, Xiang Yan

    Abstract: Recent years have witnessed an increasing number of artificial intelligence (AI) applications in transportation. As a new and emerging technology, AI's potential to advance transportation goals and the full extent of its impacts on the transportation sector is not yet well understood. As the transportation community explores these topics, it is critical to understand how transportation professiona… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

  44. arXiv:2401.08604  [pdf, other

    cs.CV cs.AI

    SAM4UDASS: When SAM Meets Unsupervised Domain Adaptive Semantic Segmentation in Intelligent Vehicles

    Authors: Weihao Yan, Yeqiang Qian, Xingyuan Chen, Hanyang Zhuang, Chunxiang Wang, Ming Yang

    Abstract: Semantic segmentation plays a critical role in enabling intelligent vehicles to comprehend their surrounding environments. However, deep learning-based methods usually perform poorly in domain shift scenarios due to the lack of labeled data for training. Unsupervised domain adaptation (UDA) techniques have emerged to bridge the gap across different driving scenes and enhance model performance on u… ▽ More

    Submitted 22 November, 2023; originally announced January 2024.

    Comments: 10 pages,9 figures,9 tables

  45. arXiv:2401.03914  [pdf, other

    cs.CV

    D3PRefiner: A Diffusion-based Denoise Method for 3D Human Pose Refinement

    Authors: Danqi Yan, Qing Gao, Yuepeng Qian, Xinxing Chen, Chenglong Fu, Yuquan Leng

    Abstract: Three-dimensional (3D) human pose estimation using a monocular camera has gained increasing attention due to its ease of implementation and the abundance of data available from daily life. However, owing to the inherent depth ambiguity in images, the accuracy of existing monocular camera-based 3D pose estimation methods remains unsatisfactory, and the estimated 3D poses usually include much noise.… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  46. Hyperspectral Image Denoising via Spatial-Spectral Recurrent Transformer

    Authors: Guanyiman Fu, Fengchao Xiong, Jianfeng Lu, Jun Zhou, Jiantao Zhou, Yuntao Qian

    Abstract: Hyperspectral images (HSIs) often suffer from noise arising from both intra-imaging mechanisms and environmental factors. Leveraging domain knowledge specific to HSIs, such as global spectral correlation (GSC) and non-local spatial self-similarity (NSS), is crucial for effective denoising. Existing methods tend to independently utilize each of these knowledge components with multiple blocks, overl… ▽ More

    Submitted 8 January, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

  47. arXiv:2401.02678  [pdf, other

    cs.SD cs.MM eess.AS

    MusicAOG: an Energy-Based Model for Learning and Sampling a Hierarchical Representation of Symbolic Music

    Authors: Yikai Qian, Tianle Wang, Xinyi Tong, Xin Jin, Duo Xu, Bo Zheng, Tiezheng Ge, Feng Yu, Song-Chun Zhu

    Abstract: In addressing the challenge of interpretability and generalizability of artificial music intelligence, this paper introduces a novel symbolic representation that amalgamates both explicit and implicit musical information across diverse traditions and granularities. Utilizing a hierarchical and-or graph representation, the model employs nodes and edges to encapsulate a broad spectrum of musical ele… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

  48. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  49. arXiv:2312.06585  [pdf, other

    cs.LG

    Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

    Authors: Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron , et al. (16 additional authors not shown)

    Abstract: Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investig… ▽ More

    Submitted 17 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: Accepted to TMLR. Camera-ready version. First three authors contributed equally

  50. arXiv:2312.04881  [pdf, other

    cs.CL cs.AI cs.IR

    Predictive Chemistry Augmented with Text Retrieval

    Authors: Yujie Qian, Zhening Li, Zhengkai Tu, Connor W. Coley, Regina Barzilay

    Abstract: This paper focuses on using natural language descriptions to enhance predictive models in the chemistry field. Conventionally, chemoinformatics models are trained with extensive structured data manually extracted from the literature. In this paper, we introduce TextReact, a novel method that directly augments predictive chemistry with texts retrieved from the literature. TextReact retrieves text d… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

    Comments: EMNLP 2023