Skip to main content

Showing 1–50 of 1,356 results for author: Li, Y

  1. arXiv:2407.11595  [pdf, other

    eess.SP

    Machine Learning in Communications: A Road to Intelligent Transmission and Processing

    Authors: Shixiong Wang, Geoffrey Ye Li

    Abstract: Prior to the era of artificial intelligence and big data, wireless communications primarily followed a conventional research route involving problem analysis, model building and calibration, algorithm design and tuning, and holistic and empirical verification. However, this methodology often encountered limitations when dealing with large-scale and complex problems and managing dynamic and massive… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Invited Article

  2. arXiv:2407.11541  [pdf, other

    eess.IV cs.CV

    Uniformly Accelerated Motion Model for Inter Prediction

    Authors: Zhuoyuan Li, Yao Li, Chuanbo Tang, Li Li, Dong Liu, Feng Wu

    Abstract: Inter prediction is a key technology to reduce the temporal redundancy in video coding. In natural videos, there are usually multiple moving objects with variable velocity, resulting in complex motion fields that are difficult to represent compactly. In Versatile Video Coding (VVC), existing inter prediction methods usually assume uniform speed motion between consecutive frames and use the linear… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: 5 pages, 4 figures

  3. arXiv:2407.10926  [pdf, other

    eess.IV cs.CV

    In-Loop Filtering via Trained Look-Up Tables

    Authors: Zhuoyuan Li, Jiacheng Li, Yao Li, Li Li, Dong Liu, Feng Wu

    Abstract: In-loop filtering (ILF) is a key technology for removing the artifacts in image/video coding standards. Recently, neural network-based in-loop filtering methods achieve remarkable coding gains beyond the capability of advanced video coding standards, which becomes a powerful coding tool candidate for future video coding standards. However, the utilization of deep neural networks brings heavy time… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 11 pages, 6 figures

  4. arXiv:2407.10471  [pdf, other

    cs.CR cs.AI cs.SD eess.AS

    GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis

    Authors: Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li

    Abstract: Amid the burgeoning development of generative models like diffusion models, the task of differentiating synthesized audio from its natural counterpart grows more daunting. Deepfake detection offers a viable solution to combat this challenge. Yet, this defensive measure unintentionally fuels the continued refinement of generative models. Watermarking emerges as a proactive and sustainable tactic, p… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  5. arXiv:2407.10307  [pdf, other

    eess.SY

    Distributed Charging Coordination for Electric Trucks under Limited Facilities and Travel Uncertainties

    Authors: Ting Bai, Yuchao Li, Karl Henrik Johansson, Jonas Mårtensson

    Abstract: In this work, we address the problem of charging coordination between electric trucks and charging stations. The problem arises from the tension between the trucks' nontrivial charging times and the stations' limited charging facilities. Our goal is to reduce the trucks' waiting times at the stations while minimizing individual trucks' operational costs. We propose a distributed coordination frame… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

  6. arXiv:2407.10048  [pdf, other

    cs.SD eess.AS

    Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification

    Authors: Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie

    Abstract: Trained on 680,000 hours of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

  7. arXiv:2407.09732  [pdf, other

    eess.AS cs.LG cs.SD

    Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis

    Authors: Xilin Jiang, Yinghao Aaron Li, Adrian Nicolas Florea, Cong Han, Nima Mesgarani

    Abstract: It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. We compar… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  8. arXiv:2407.09251  [pdf, other

    cs.LG cs.AI eess.SP

    Deep Adversarial Defense Against Multilevel-Lp Attacks

    Authors: Ren Wang, Yuxuan Li, Alfred Hero

    Abstract: Deep learning models have shown considerable vulnerability to adversarial attacks, particularly as attacker strategies become more sophisticated. While traditional adversarial training (AT) techniques offer some resilience, they often focus on defending against a single type of attack, e.g., the $\ell_\infty$-norm attack, which can fail for other types. This paper introduces a computationally effi… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  9. arXiv:2407.08299  [pdf, other

    cs.SI eess.SY

    Evolving Network Modeling Driven by the Degree Increase and Decrease Mechanism

    Authors: Yuhan Li, Minyu Feng, Jürgen Kurths

    Abstract: Ever since the Barabási-Albert (BA) scale-free network has been proposed, network modeling has been studied intensively in light of the network growth and the preferential attachment (PA). However, numerous real systems are featured with a dynamic evolution including network reduction in addition to network growth. In this paper, we propose a novel mechanism for evolving networks from the perspect… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  10. arXiv:2407.08230  [pdf, other

    eess.SP

    Handling Distance Constraint in Movable Antenna Aided Systems: A General Optimization Framework

    Authors: Yichen Jin, Qingfeng Lin, Yang Li, Yik-Chung Wu

    Abstract: The movable antenna (MA) is a promising technology to exploit more spatial degrees of freedom for enhancing wireless system performance. However, the MA-aided system introduces the non-convex antenna distance constraints, which poses challenges in the underlying optimization problems. To fill this gap, this paper proposes a general framework for optimizing the MA-aided system under the antenna dis… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  11. arXiv:2407.07567  [pdf, ps, other

    eess.SP

    Pilot-Based SFO Estimation for Bistatic Integrated Sensing and Communication

    Authors: Lucas Giroto de Oliveira, Yueheng Li, Silvio Mandelli, David Brunner, Marcus Henninger, Xiang Wan, Tie Jun Cui, Thomas Zwick, Benjamin Nuss

    Abstract: Enabling bistatic radar sensing within the context of integrated sensing and communication (ISAC) for future sixth generation mobile networks demands strict synchronization accuracy, which is particularly challenging to be achieved with over-the-air synchronization. Existing algorithms handle time and frequency offsets adequately, but provide insufficiently accurate sampling frequency offset (SFO)… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  12. arXiv:2407.07372  [pdf, other

    eess.IV cs.CV

    Trustworthy Contrast-enhanced Brain MRI Synthesis

    Authors: Jiyao Liu, Yuxin Li, Shangqi Gao, Yuncheng Zhou, Xin Gao, Ningsheng Xu, Xiao-Yong Zhang, Xiahai Zhuang

    Abstract: Contrast-enhanced brain MRI (CE-MRI) is a valuable diagnostic technique but may pose health risks and incur high costs. To create safer alternatives, multi-modality medical image translation aims to synthesize CE-MRI images from other available modalities. Although existing methods can generate promising predictions, they still face two challenges, i.e., exhibiting over-confidence and lacking inte… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: 11 pages, 3 figures

  13. arXiv:2407.07337  [pdf, other

    cs.NI eess.SP

    In-Orbit Processing or Not? Sunlight-Aware Task Scheduling for Energy-Efficient Space Edge Computing Networks

    Authors: Weisen Liu, Zeqi Lai, Qian Wu, Hewu Li, Qi Zhang, Zonglun Li, Yuanjie Li, Jun Liu

    Abstract: With the rapid evolution of space-borne capabilities, space edge computing (SEC) is becoming a new computation paradigm for future integrated space and terrestrial networks. Satellite edges adopt advanced on-board hardware, which not only enables new opportunities to perform complex intelligent tasks in orbit, but also involves new challenges due to the additional energy consumption in power-const… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: Accepted by IEEE INFOCOM 2024

  14. arXiv:2407.06633  [pdf, other

    eess.IV cs.CV

    Variational Zero-shot Multispectral Pansharpening

    Authors: Xiangyu Rui, Xiangyong Cao, Yining Li, Deyu Meng

    Abstract: Pansharpening aims to generate a high spatial resolution multispectral image (HRMS) by fusing a low spatial resolution multispectral image (LRMS) and a panchromatic image (PAN). The most challenging issue for this task is that only the to-be-fused LRMS and PAN are available, and the existing deep learning-based methods are unsuitable since they rely on many training pairs. Traditional variational… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  15. arXiv:2407.04736  [pdf, other

    eess.SP cs.AI cs.LG

    SCDM: Unified Representation Learning for EEG-to-fNIRS Cross-Modal Generation in MI-BCIs

    Authors: Yisheng Li, Shuqiang Wang

    Abstract: Hybrid motor imagery brain-computer interfaces (MI-BCIs), which integrate both electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) signals, outperform those based solely on EEG. However, simultaneously recording EEG and fNIRS signals is highly challenging due to the difficulty of colocating both types of sensors on the same scalp surface. This physical constraint complic… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 11 pages, 5 figures

  16. arXiv:2407.04353  [pdf, other

    eess.IV cs.CV

    Segmenting Medical Images: From UNet to Res-UNet and nnUNet

    Authors: Lina Huang, Alina Miron, Kate Hone, Yongmin Li

    Abstract: This study provides a comparative analysis of deep learning models including UNet, Res-UNet, Attention Res-UNet, and nnUNet, and evaluates their performance in brain tumour, polyp, and multi-class heart segmentation tasks. The analysis focuses on precision, accuracy, recall, Dice Similarity Coefficient (DSC), and Intersection over Union (IoU) to assess their clinical applicability. In brain tumour… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: 7 pages, 3 figures

  17. arXiv:2407.04174  [pdf, other

    cs.NI eess.SP

    Gemini: Integrating Full-fledged Sensing upon Millimeter Wave Communications

    Authors: Yilong Li, Zhe Chen, Jun Luo, Suman Banerjee

    Abstract: Integrating millimeter wave (mmWave)technology in both communication and sensing is promising as it enables the reuse of existing spectrum and infrastructure without draining resources. Most existing systems piggyback sensing onto conventional communication modes without fully exploiting the potential of integrated sensing and communication (ISAC) in mmWave radios (not full-fledged). In this paper… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: 12 pages

  18. arXiv:2407.04051  [pdf, other

    cs.SD cs.AI eess.AS

    FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

    Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

    Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  19. arXiv:2407.03671  [pdf

    eess.SY

    Spatio-temporal cooperative control Method of Highway Ramp Merge Based on Vehicle-road Coordination

    Authors: Xiaoxue Xu, Maokai Lai, Haitao Zhang, Xiang Dong, Tao Li, Jie Wu, Yuan Li, Ting Peng

    Abstract: The merging area of highway ramps faces multiple challenges, including traffic congestion, collision risks, speed mismatches, driver behavior uncertainties, limited visibility, and bottleneck effects. However, autonomous vehicles engaging in depth coordination between vehicle and road in merging zones, by pre-planning and uploading travel trajectories, can significantly enhance the safety and effi… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  20. arXiv:2407.03159  [pdf, other

    cs.SI eess.SY physics.soc-ph

    Protection Degree and Migration in the Stochastic SIRS Model: A Queueing System Perspective

    Authors: Yuhan Li, Ziyan Zeng, Minyu Feng, Jürgen Kurths

    Abstract: With the prevalence of COVID-19, the modeling of epidemic propagation and its analyses have played a significant role in controlling epidemics. However, individual behaviors, in particular the self-protection and migration, which have a strong influence on epidemic propagation, were always neglected in previous studies. In this paper, we mainly propose two models from the individual and population… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  21. arXiv:2407.02251  [pdf, other

    eess.SP

    White-Box 3D-OMP-Transformer for ISAC

    Authors: Bowen Zhang, Geoffrey Ye Li

    Abstract: Transformers have found broad applications for their great ability to capture long-range dependency among the inputs using attention mechanisms. The recent success of transformers increases the need for mathematical interpretation of their underlying working mechanisms, leading to the development of a family of white-box transformer-like deep network architectures. However, designing white-box tra… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  22. arXiv:2407.02052  [pdf, other

    eess.AS cs.SD

    The USTC-NERCSLIP Systems for The ICMC-ASR Challenge

    Authors: Minghui Wu, Luzhen Xu, Jie Zhang, Haitao Tang, Yanyan Yue, Ruizhi Liao, Jintao Zhao, Zhengzhe Zhang, Yichi Wang, Haoyin Yan, Hongliang Yu, Tongle Ma, Jiachen Liu, Chongliang Wu, Yongchao Li, Yanyong Zhang, Xin Fang, Yue Zhang

    Abstract: This report describes the submitted system to the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) challenge, which considers the ASR task with multi-speaker overlapping and Mandarin accent dynamics in the ICMC case. We implement the front-end speaker diarization using the self-supervised learning representation based multi-speaker embedding and beamforming using the speaker position,… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted at ICASSP 2024

  23. arXiv:2407.01908  [pdf, other

    eess.IV cs.CV

    Efficient Stochastic Differential Equation for DEM Super Resolution with Void Filling

    Authors: Tongtong Zhang, Zongcheng Zuo, Yuanxiang Li

    Abstract: Digital Elevation Model (DEM) plays a fundamental role in remote sensing and photogrammetry. Enhancing the quality of DEM is crucial for various applications. Although multiple types of defects may appear simultaneously in the same DEM, they are commonly addressed separately. Most existing approaches only aim to fill the DEM voids, or apply super-resolution to the intact DEM. This paper introduces… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  24. arXiv:2407.00896  [pdf, other

    eess.SP cs.AI

    Channel Modeling Aided Dataset Generation for AI-Enabled CSI Feedback: Advances, Challenges, and Solutions

    Authors: Yupeng Li, Gang Li, Zirui Wen, Shuangfeng Han, Shijian Gao, Guangyi Liu, Jiangzhou Wang

    Abstract: The AI-enabled autoencoder has demonstrated great potential in channel state information (CSI) feedback in frequency division duplex (FDD) multiple input multiple output (MIMO) systems. However, this method completely changes the existing feedback strategies, making it impractical to deploy in recent years. To address this issue, this paper proposes a channel modeling aided data augmentation metho… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  25. arXiv:2407.00280  [pdf, other

    eess.IV cs.CV

    IVCA: Inter-Relation-Aware Video Complexity Analyzer

    Authors: Junqi Liao, Yao Li, Zhuoyuan Li, Li Li, Dong Liu

    Abstract: To meet the real-time analysis requirements of video streaming applications, we propose an inter-relation-aware video complexity analyzer (IVCA) as an extension to VCA. The IVCA addresses the limitation of VCA by considering inter-frame relations, namely motion and reference structure. First, we enhance the accuracy of temporal features by introducing feature-domain motion estimation into the IVCA… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

    Comments: The report for the solution of second prize winner in ICIP 2024 Grand Challenge on Video Complexity (Team: USTC-iVC_Team1, USTC-iVC_Team2)

  26. arXiv:2406.19483  [pdf, ps, other

    eess.SP

    Localization in Multipath Environments via Active Sensing with Reconfigurable Intelligent Surfaces

    Authors: Yinghan Li, Wei Yu

    Abstract: This letter investigates an uplink pilot-based wireless indoor localization problem in a multipath environment for a single-input single-output (SISO) narrowband communication system aided by reconfigurable intelligent surface (RIS). The indoor localization problem is challenging because the uplink channel consists of multiple overlapping propagation paths with varying amplitudes and phases, which… ▽ More

    Submitted 8 July, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

  27. arXiv:2406.19043  [pdf

    eess.IV cs.AI cs.CV cs.DB

    CMRxRecon2024: A Multi-Modality, Multi-View K-Space Dataset Boosting Universal Machine Learning for Accelerated Cardiac MRI

    Authors: Zi Wang, Fanwen Wang, Chen Qin, Jun Lyu, Ouyang Cheng, Shuo Wang, Yan Li, Mengyao Yu, Haoyu Zhang, Kunyuan Guo, Zhang Shi, Qirong Li, Ziqiang Xu, Yajing Zhang, Hao Li, Sha Hua, Binghua Chen, Longyu Sun, Mengting Sun, Qin Li, Ying-Hua Chu, Wenjia Bai, Jing Qin, Xiahai Zhuang, Claudia Prieto , et al. (7 additional authors not shown)

    Abstract: Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover h… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 19 pages, 3 figures, 2 tables

  28. arXiv:2406.18345  [pdf, other

    cs.LG eess.SP

    EmT: A Novel Transformer for Generalized Cross-subject EEG Emotion Recognition

    Authors: Yi Ding, Chengxuan Tong, Shuailei Zhang, Muyun Jiang, Yong Li, Kevin Lim Jun Liang, Cuntai Guan

    Abstract: Integrating prior knowledge of neurophysiology into neural network architecture enhances the performance of emotion decoding. While numerous techniques emphasize learning spatial and short-term temporal patterns, there has been limited emphasis on capturing the vital long-term contextual information associated with emotional cognitive processes. In order to address this discrepancy, we introduce a… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: 11 pages, 5 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  29. arXiv:2406.18069  [pdf, other

    eess.SP cs.AI cs.CL

    Large Language Models for Cuffless Blood Pressure Measurement From Wearable Biosignals

    Authors: Zengding Liu, Chen Chen, Jiannong Cao, Minglei Pan, Jikui Liu, Nan Li, Fen Miao, Ye Li

    Abstract: Large language models (LLMs) have captured significant interest from both academia and industry due to their impressive performance across various textual tasks. However, the potential of LLMs to analyze physiological time-series data remains an emerging research field. Particularly, there is a notable gap in the utilization of LLMs for analyzing wearable biosignals to achieve cuffless blood press… ▽ More

    Submitted 4 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

  30. arXiv:2406.17257  [pdf, other

    cs.CL cs.SD eess.AS

    Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

    Authors: Yingting Li, Ambuj Mehrish, Bryan Chew, Bo Cheng, Soujanya Poria

    Abstract: Different languages have distinct phonetic systems and vary in their prosodic features making it challenging to develop a Text-to-Speech (TTS) model that can effectively synthesise speech in multilingual settings. Furthermore, TTS architecture needs to be both efficient enough to capture nuances in multiple languages and efficient enough to be practical for deployment. The standard approach is to… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  31. arXiv:2406.14664  [pdf, ps, other

    eess.SP

    Experimental Validation of Cooperative RSS-based Localization with Unknown Transmit Power, Path Loss Exponent, and Precise Anchor Location

    Authors: Yingquan Li, Bodhibrata Mukhopadhyay, Jiajie Xu, Mohamed-Slim Alouini

    Abstract: Received signal strength (RSS)--based cooperative localization has gained significant attention due to its straightforward system architectures and cost-effectiveness. In this paper, we propose Cooperative Localization Techniques (with Unknown Parameters), referred to as CTUP(s), which consider uncertainty in anchor nodes' locations and assume the transmit power and \textcolor{blue}{path loss expo… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  32. arXiv:2406.13705  [pdf, other

    eess.IV cs.AI cs.CV

    EndoUIC: Promptable Diffusion Transformer for Unified Illumination Correction in Capsule Endoscopy

    Authors: Long Bai, Tong Chen, Qiaozhi Tan, Wan Jun Nah, Yanheng Li, Zhicheng He, Sishen Yuan, Zhen Chen, Jinlin Wu, Mobarakol Islam, Zhen Li, Hongbin Liu, Hongliang Ren

    Abstract: Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels rema… ▽ More

    Submitted 8 July, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: To appear in MICCAI 2024. Code and dataset availability: https://github.com/longbai1006/EndoUIC

  33. arXiv:2406.12699  [pdf, other

    cs.SD eess.AS eess.SP

    Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition

    Authors: Kuan-Chen Wang, You-Jin Li, Wei-Lun Chen, Yu-Wen Chen, Yi-Ching Wang, Ping-Cheng Yeh, Chao Zhang, Yu Tsao

    Abstract: Noise robustness is critical when applying automatic speech recognition (ASR) in real-world scenarios. One solution involves the used of speech enhancement (SE) models as the front end of ASR. However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study intro… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  34. arXiv:2406.12646  [pdf, other

    eess.IV cs.AI cs.CV

    An Empirical Study on the Fairness of Foundation Models for Multi-Organ Image Segmentation

    Authors: Qin Li, Yizhe Zhang, Yan Li, Jun Lyu, Meng Liu, Longyu Sun, Mengting Sun, Qirong Li, Wenyue Mao, Xinran Wu, Yajing Zhang, Yinghua Chu, Shuo Wang, Chengyan Wang

    Abstract: The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM's performance from the perspectives of overall accuracy and efficiency, yet little attention was given to the fairness considerations. This oversight raises questions about the potenti… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted to MICCAI-2024

  35. arXiv:2406.11219  [pdf, other

    cs.RO eess.SY

    A Swift and Omnidirectional Formation Approach based on Hierarchical Reorganization

    Authors: Yuzhu Li, Wei Dong

    Abstract: Current formations commonly rely on invariant hierarchical structures, such as predetermined leaders or enumerated formation shapes. These structures could be unidirectional and sluggish, constraining their adaptability and agility when encountering cluttered environments. To surmount these constraints, this work proposes an omnidirectional affine formation approach with hierarchical reorganizatio… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  36. arXiv:2406.10298  [pdf

    eess.SY

    Enhancing Resilience of Power Systems against Typhoon Threats: A Hybrid Data-Model Driven Approach

    Authors: Yang Li

    Abstract: This chapter addresses the increasing vulnerability of coastal regions to typhoons and the consequent power outages, emphasizing the critical role of power transmission systems in disaster resilience. It introduces a framework for assessing and enhancing the resilience of these systems against typhoon impacts. The approach integrates a hybrid-driven model for system failure analysis and resilience… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted in 'Design for Reliability and Resilience of Power Systems,' Elsevier

  37. arXiv:2406.10236  [pdf, other

    eess.IV cs.AI

    Lightening Anything in Medical Images

    Authors: Ben Fei, Yixuan Li, Weidong Yang, Hengjun Gao, Jingyi Xu, Lipeng Ma, Yatian Yang, Pinghong Zhou

    Abstract: The development of medical imaging techniques has made a significant contribution to clinical decision-making. However, the existence of suboptimal imaging quality, as indicated by irregular illumination or imbalanced intensity, presents significant obstacles in automating disease screening, analysis, and diagnosis. Existing approaches for natural image enhancement are mostly trained with numerous… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: 23 pages, 6 figures

  38. arXiv:2406.09695  [pdf, other

    eess.SP

    Machine learning-based Near-field Emitter Localization via Grouped Hybrid Analog and Digital Massive MIMO Receive Array

    Authors: Yifan Li, Feng Shu, Jiatong Bai, Cunhua Pan, Yongpeng Wu, Yaoliang Song, Jiangzhou Wang

    Abstract: A fully-digital massive MIMO receive array is promising to meet the high-resolution requirement of near-field (NF) emitter localization, but it also results in the significantly increasing of hardware costs and algorithm complexity. In order to meet the future demand for green communication while maintaining high performance, the grouped hybrid analog and digital (HAD) structure is proposed for NF… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  39. arXiv:2406.09238  [pdf, other

    cs.IT eess.SP

    Near-Field Multiuser Communications based on Sparse Arrays

    Authors: Kangjian Chen, Chenhao Qi, Geoffrey Ye Li, Octavia A. Dobre

    Abstract: This paper considers near-field multiuser communications based on sparse arrays (SAs). First, for the uniform SAs (USAs), we analyze the beam gains of channel steering vectors, which shows that increasing the antenna spacings can effectively improve the spatial resolution of the antenna arrays to enhance the sum rate of multiuser communications. Then, we investigate nonuniform SAs (NSAs) to mitiga… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  40. arXiv:2406.08393  [pdf, other

    eess.AS cs.SD

    SCDNet: Self-supervised Learning Feature-based Speaker Change Detection

    Authors: Yue Li, Xinsheng Wang, Li Zhang, Lei Xie

    Abstract: Speaker Change Detection (SCD) is to identify boundaries among speakers in a conversation. Motivated by the success of fine-tuning wav2vec 2.0 models for the SCD task, a further investigation of self-supervised learning (SSL) features for SCD is conducted in this work. Specifically, an SCD model, named SCDNet, is proposed. With this model, various state-of-the-art SSL models, including Hubert, wav… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  41. arXiv:2406.08353  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques

    Authors: Yuanchao Li, Peter Bell, Catherine Lai

    Abstract: Text data is commonly utilized as a primary input to enhance Speech Emotion Recognition (SER) performance and reliability. However, the reliance on human-transcribed text in most studies impedes the development of practical SER systems, creating a gap between in-lab research and real-world scenarios where Automatic Speech Recognition (ASR) serves as the text source. Hence, this study benchmarks SE… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  42. arXiv:2406.08122  [pdf

    eess.AS cs.SD

    Fully Few-shot Class-incremental Audio Classification Using Expandable Dual-embedding Extractor

    Authors: Yongjie Si, Yanxiong Li, Jialong Li, Jiaxin Tan, Qianhua He

    Abstract: It's assumed that training data is sufficient in base session of few-shot class-incremental audio classification. However, it's difficult to collect abundant samples for model training in base session in some practical scenarios due to the data scarcity of some classes. This paper explores a new problem of fully few-shot class-incremental audio classification with few training samples in all sessi… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted for publication on Interspeech 2024. 5 pages, 3 figures, 5 tables

  43. arXiv:2406.08119  [pdf

    eess.AS cs.SD

    Low-Complexity Acoustic Scene Classification Using Parallel Attention-Convolution Network

    Authors: Yanxiong Li, Jiaxin Tan, Guoqing Chen, Jialong Li, Yongjie Si, Qianhua He

    Abstract: This work is an improved system that we submitted to task 1 of DCASE2023 challenge. We propose a method of low-complexity acoustic scene classification by a parallel attention-convolution network which consists of four modules, including pre-processing, fusion, global and local contextual information extraction. The proposed network is computationally efficient to capture global and local contextu… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted for publication on Interspeech 2024. 5 pages, 4 figures, 3 tables

  44. arXiv:2406.08112  [pdf, other

    cs.SD cs.AI eess.AS

    Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio

    Authors: Yi Lu, Yuankun Xie, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Zhiyong Wang, Xin Qi, Xuefei Liu, Yongwei Li, Yukun Liu, Xiaopeng Wang, Shuchen Shi

    Abstract: With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024. arXiv admin note: substantial text overlap with arXiv:2405.04880

  45. arXiv:2406.05700  [pdf, other

    cs.CV eess.IV

    HDMba: Hyperspectral Remote Sensing Imagery Dehazing with State Space Model

    Authors: Hang Fu, Genyun Sun, Yinhe Li, Jinchang Ren, Aizhu Zhang, Cheng Jing, Pedram Ghamisi

    Abstract: Haze contamination in hyperspectral remote sensing images (HSI) can lead to spatial visibility degradation and spectral distortion. Haze in HSI exhibits spatial irregularity and inhomogeneous spectral distribution, with few dehazing networks available. Current CNN and Transformer-based dehazing methods fail to balance global scene recovery, local detail retention, and computational efficiency. Ins… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

  46. arXiv:2406.05692  [pdf, other

    cs.SD cs.AI eess.AS

    SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion

    Authors: Bingsong Bai, Fengping Wang, Yingming Gao, Ya Li

    Abstract: Diffusion-based singing voice conversion (SVC) models have shown better synthesis quality compared to traditional methods. However, in cross-domain SVC scenarios, where there is a significant disparity in pitch between the source and target voice domains, the models tend to generate audios with hoarseness, posing challenges in achieving high-quality vocal outputs. Therefore, in this paper, we prop… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  47. arXiv:2406.05452  [pdf, other

    eess.SP cs.IT

    Near-Field Channel Estimation for Extremely Large-Scale Terahertz Communications

    Authors: Songjie Yang, Yizhou Peng, Wanting Lyu, Ya Li, Hongjun He, Zhongpei Zhang, Chau Yuen

    Abstract: Future Terahertz communications exhibit significant potential in accommodating ultra-high-rate services. Employing extremely large-scale array antennas is a key approach to realize this potential, as they can harness substantial beamforming gains to overcome the severe path loss and leverage the electromagnetic advantages in the near field. This paper proposes novel estimation methods designed to… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

  48. arXiv:2406.05170  [pdf

    q-bio.OT cs.CV eess.IV

    Research on Tumors Segmentation based on Image Enhancement Method

    Authors: Danyi Huang, Ziang Liu, Yizhou Li

    Abstract: One of the most effective ways to treat liver cancer is to perform precise liver resection surgery, the key step of which includes precise digital image segmentation of the liver and its tumor. However, traditional liver parenchymal segmentation techniques often face several challenges in performing liver segmentation: lack of precision, slow processing speed, and computational burden. These short… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  49. arXiv:2406.04683  [pdf, other

    cs.SD eess.AS

    PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation

    Authors: Shuchen Shi, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu Qiang, Yi Lu, Xin Qi, Xuefei Liu, Yukun Liu, Yongwei Li, Zhiyong Wang, Xiaopeng Wang

    Abstract: Text-to-Audio (TTA) aims to generate audio that corresponds to the given text description, playing a crucial role in media production. The text descriptions in TTA datasets lack rich variations and diversity, resulting in a drop in TTA model performance when faced with complex text. To address this issue, we propose a method called Portable Plug-in Prompt Refiner, which utilizes rich knowledge abo… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: accepted by INTERSPEECH2024

  50. arXiv:2406.04324  [pdf, other

    cs.CV eess.IV

    SF-V: Single Forward Video Generation Model

    Authors: Zhixing Zhang, Yanyu Li, Yushu Wu, Yanwu Xu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Dimitris Metaxas, Sergey Tulyakov, Jian Ren

    Abstract: Diffusion-based video generation models have demonstrated remarkable success in obtaining high-fidelity videos through the iterative denoising process. However, these models require multiple denoising steps during sampling, resulting in high computational costs. In this work, we propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune p… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project Page: https://snap-research.github.io/SF-V