Skip to main content

Showing 1–50 of 752 results for author: Shi, J

  1. arXiv:2407.01621  [pdf, other

    cs.LG q-bio.QM stat.ME stat.ML

    Deciphering interventional dynamical causality from non-intervention systems

    Authors: Jifan Shi, Yang Li, Juan Zhao, Siyang Leng, Kazuyuki Aihara, Luonan Chen, Wei Lin

    Abstract: Detecting and quantifying causality is a focal topic in the fields of science, engineering, and interdisciplinary studies. However, causal studies on non-intervention systems attract much attention but remain extremely challenging. To address this challenge, we propose a framework named Interventional Dynamical Causality (IntDC) for such non-intervention systems, along with its computational crite… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

  2. arXiv:2407.01505  [pdf, other

    cs.CL cs.AI

    Self-Cognition in Large Language Models: An Exploratory Study

    Authors: Dongping Chen, Jiawen Shi, Yao Wan, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun

    Abstract: While Large Language Models (LLMs) have achieved remarkable success across various applications, they also raise concerns regarding self-cognition. In this paper, we perform a pioneering study to explore self-cognition in LLMs. Specifically, we first construct a pool of self-cognition instruction prompts to evaluate where an LLM exhibits self-cognition and four well-designed principles to quantify… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted at ICML 2024 Large Language Models and Cognition Workshop

  3. arXiv:2407.00837  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Towards Robust Speech Representation Learning for Thousands of Languages

    Authors: William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 millio… ▽ More

    Submitted 2 July, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

    Comments: Updated affiliations; 20 pages

  4. arXiv:2406.18995  [pdf, other

    cs.LG cs.AI

    FedMLP: Federated Multi-Label Medical Image Classification under Task Heterogeneity

    Authors: Zhaobin Sun, Nannan Wu, Junjie Shi, Li Yu, Xin Yang, Kwang-Ting Cheng, Zengqiang Yan

    Abstract: Cross-silo federated learning (FL) enables decentralized organizations to collaboratively train models while preserving data privacy and has made significant progress in medical image classification. One common assumption is task homogeneity where each client has access to all classes during training. However, in clinical practice, given a multi-label classification task, constrained by the level… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: Early accepted by MICCAI 2024

  5. arXiv:2406.18610  [pdf, other

    cs.CV

    Vox-UDA: Voxel-wise Unsupervised Domain Adaptation for Cryo-Electron Subtomogram Segmentation with Denoised Pseudo Labeling

    Authors: Haoran Li, Xingjian Li, Jiahua Shi, Huaming Chen, Bo Du, Daisuke Kihara, Johan Barthelemy, Jun Shen, Min Xu

    Abstract: Cryo-Electron Tomography (cryo-ET) is a 3D imaging technology facilitating the study of macromolecular structures at near-atomic resolution. Recent volumetric segmentation approaches on cryo-ET images have drawn widespread interest in biological sector. However, existing methods heavily rely on manually labeled data, which requires highly professional skills, thereby hindering the adoption of full… ▽ More

    Submitted 30 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: 11 pages

  6. arXiv:2406.18344  [pdf, other

    cs.CV

    AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space

    Authors: Huzheng Yang, James Gee, Jianbo Shi

    Abstract: We study the intriguing connection between visual data, deep networks, and the brain. Our method creates a universal channel alignment by using brain voxel fMRI response prediction as the training objective. We discover that deep networks, trained with different objectives, share common feature channels across various models. These channels can be clustered into recurring sets, corresponding to di… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  7. arXiv:2406.18198  [pdf, other

    cs.CV

    VDG: Vision-Only Dynamic Gaussian for Driving Simulation

    Authors: Hao Li, Jingfeng Li, Dingwen Zhang, Chenming Wu, Jieqi Shi, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Junwei Han

    Abstract: Dynamic Gaussian splatting has led to impressive scene reconstruction and image synthesis advances in novel views. Existing methods, however, heavily rely on pre-computed poses and Gaussian initialization by Structure from Motion (SfM) algorithms or expensive sensors. For the first time, this paper addresses this issue by integrating self-supervised VO into our pose-free dynamic Gaussian method (V… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  8. arXiv:2406.16495  [pdf, other

    cs.CL cs.AI

    OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser

    Authors: Jingze Shi, Ting Xie, Bingheng Wu, Chunjun Zheng, Kai Wang

    Abstract: Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the seq… ▽ More

    Submitted 24 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

  9. arXiv:2406.12638  [pdf, other

    cs.CV cs.LG

    Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model

    Authors: Jiang-Xin Shi, Chi Zhang, Tong Wei, Yu-Feng Li

    Abstract: Pre-trained vision-language models like CLIP have shown powerful zero-shot inference ability via image-text matching and prove to be strong few-shot learners in various downstream tasks. However, in real-world scenarios, adapting CLIP to downstream tasks may encounter the following challenges: 1) data may exhibit long-tailed data distributions and might not have abundant samples for all the classe… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted by KDD 2024

  10. arXiv:2406.10981  [pdf, other

    cs.CV

    ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

    Authors: Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao

    Abstract: With the advance of diffusion models, today's video generation has achieved impressive quality. But generating temporal consistent long videos is still challenging. A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip. However, existing approaches all involve bidirectional computa… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: Code will be available at https://github.com/Dawn-LX/Causal-VideoGen

  11. arXiv:2406.10911  [pdf, other

    cs.SD eess.AS

    SingMOS: An extensive Open-Source Singing Voice Dataset for MOS Prediction

    Authors: Yuxun Tang, Jiatong Shi, Yuning Wu, Qin Jin

    Abstract: In speech generation tasks, human subjective ratings, usually referred to as the opinion score, are considered the "gold standard" for speech quality evaluation, with the mean opinion score (MOS) serving as the primary evaluation metric. Due to the high cost of human annotation, several MOS prediction systems have emerged in the speech domain, demonstrating good performance. These MOS prediction m… ▽ More

    Submitted 20 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

  12. arXiv:2406.09913  [pdf, other

    cs.CV

    OpenECAD: An Efficient Visual Language Model for Computer-Aided Design

    Authors: Zhe Yuan, Jianqi Shi, Yanhong Huang

    Abstract: Computer-aided design (CAD) tools are utilized in the manufacturing industry for modeling everything from cups to spacecraft. These programs are complex to use and typically require years of training and experience to master. Structured and well-constrained 2D sketches and 3D constructions are crucial components of CAD modeling. A well-executed CAD model can be seamlessly integrated into the manuf… ▽ More

    Submitted 22 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

  13. arXiv:2406.09869  [pdf, ps, other

    cs.SD eess.AS

    MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model

    Authors: Jiatong Shi, Xutai Ma, Hirofumi Inaguma, Anna Sun, Shinji Watanabe

    Abstract: Speech discrete representation has proven effective in various downstream applications due to its superior compression rate of the waveform, fast convergence during training, and compatibility with other modalities. Discrete units extracted from self-supervised learning (SSL) models have emerged as a prominent approach for obtaining speech discrete representation. However, while discrete units hav… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech2024

  14. arXiv:2406.08905  [pdf, other

    cs.SD eess.AS

    SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models

    Authors: Yuxun Tang, Yuning Wu, Jiatong Shi, Qin Jin

    Abstract: Discrete representation has shown advantages in speech generation tasks, wherein discrete tokens are derived by discretizing hidden features from self-supervised learning (SSL) pre-trained models. However, the direct application of speech SSL models to singing generation encounters domain gaps between speech and singing. Furthermore, singing generation necessitates a more refined representation th… ▽ More

    Submitted 20 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  15. arXiv:2406.08761  [pdf, other

    cs.SD eess.AS

    VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

    Authors: Yifeng Yu, Jiatong Shi, Yuning Wu, Shinji Watanabe

    Abstract: Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pr… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 4 pages, 2 figures

  16. arXiv:2406.08641  [pdf, ps, other

    cs.SD cs.CL eess.AS

    ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

    Authors: Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee, Shinji Watanabe

    Abstract: ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a ne… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  17. arXiv:2406.08416  [pdf, other

    cs.SD eess.AS

    TokSing: Singing Voice Synthesis based on Discrete Tokens

    Authors: Yuning Wu, Chunlei zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin

    Abstract: Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate representations compared to traditional continuous Mel spectrograms. However, when it comes to singing voice synthesis(SVS), achieving higher levels of melody… ▽ More

    Submitted 20 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  18. arXiv:2406.07725  [pdf, ps, other

    cs.SD eess.AS

    The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

    Authors: Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin Jin

    Abstract: Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compression and restoration, speech recognition, and speech generation. To foster exploration in this domain, we introduce the Interspeech 2024 Challenge,… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: This manuscript has been accepted by Interspeech2024

  19. arXiv:2406.07686  [pdf, other

    cs.CV

    AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

    Authors: Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, Yapeng Tian

    Abstract: Recent Diffusion Transformers (DiTs) have shown impressive capabilities in generating high-quality single-modality content, including images, videos, and audio. However, it is still under-explored whether the transformer-based diffuser can efficiently denoise the Gaussian noises towards superb multimodal content creation. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  20. arXiv:2406.07571  [pdf, other

    cs.CY

    Supporting Self-Reflection at Scale with Large Language Models: Insights from Randomized Field Experiments in Classrooms

    Authors: Harsh Kumar, Ruiwei Xiao, Benjamin Lawson, Ilya Musabirov, Jiakai Shi, Xinyuan Wang, Huayin Luo, Joseph Jay Williams, Anna Rafferty, John Stamper, Michael Liut

    Abstract: Self-reflection on learning experiences constitutes a fundamental cognitive process, essential for the consolidation of knowledge and the enhancement of learning efficacy. However, traditional methods to facilitate reflection often face challenges in personalization, immediacy of feedback, engagement, and scalability. Integration of Large Language Models (LLMs) into the reflection process could mi… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

    Comments: Accepted at L@S'24

  21. arXiv:2406.04614  [pdf, ps, other

    cs.CL cs.AI

    LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model

    Authors: Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song, Xiao-Wen Yang, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li

    Abstract: Large language models (LLMs), including both proprietary and open-source models, have showcased remarkable capabilities in addressing a wide range of downstream tasks. Nonetheless, when it comes to practical Chinese legal tasks, these models fail to meet the actual requirements. Proprietary models do not ensure data privacy for sensitive legal cases, while open-source models demonstrate unsatisfac… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Technical Report

  22. arXiv:2406.04329  [pdf, other

    cs.LG stat.ML

    Simplified and Generalized Masked Diffusion for Discrete Data

    Authors: Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, Michalis K. Titsias

    Abstract: Masked (or absorbing) diffusion is actively explored as an alternative to autoregressive models for generative modeling of discrete data. However, existing work in this area has been hindered by unnecessarily complex model formulations and unclear relationships between different perspectives, leading to suboptimal parameterization, training objectives, and ad hoc adjustments to counteract these is… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  23. arXiv:2406.03805  [pdf, other

    cs.CR

    AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens

    Authors: Lin Lu, Hai Yan, Zenghui Yuan, Jiawen Shi, Wenqi Wei, Pin-Yu Chen, Pan Zhou

    Abstract: Jailbreak attacks in large language models (LLMs) entail inducing the models to generate content that breaches ethical and legal norm through the use of malicious prompts, posing a substantial threat to LLM security. Current strategies for jailbreak attack and defense often focus on optimizing locally within specific algorithmic frameworks, resulting in ineffective optimization and limited scalabi… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: 32 pages, 2 figures

  24. arXiv:2406.02950  [pdf, other

    eess.AS cs.CL cs.SD

    4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

    Authors: Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, Shinji Watanabe

    Abstract: End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on appl… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: submitted to IEEE/ACM Transactions on Audio Speech and Language Processing

  25. arXiv:2406.02880  [pdf, other

    cs.CV cs.AI

    Controllable Talking Face Generation by Implicit Facial Keypoints Editing

    Authors: Dong Zhao, Jiaying Shi, Wenjun Li, Shudong Wang, Shenghui Xu, Zhaoming Pan

    Abstract: Audio-driven talking face generation has garnered significant interest within the domain of digital human research. Existing methods are encumbered by intricate model architectures that are intricately dependent on each other, complicating the process of re-editing image or video inputs. In this work, we present ControlTalk, a talking face generation method to control face expression deformation b… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  26. arXiv:2406.02438  [pdf, other

    eess.AS cs.MM cs.SD

    CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

    Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan

    Abstract: Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi… ▽ More

    Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  27. arXiv:2406.01256  [pdf, other

    cs.CV cs.AI

    Augmented Commonsense Knowledge for Remote Object Grounding

    Authors: Bahram Mohammadi, Yicong Hong, Yuankai Qi, Qi Wu, Shirui Pan, Javen Qinfeng Shi

    Abstract: The vision-and-language navigation (VLN) task necessitates an agent to perceive the surroundings, follow natural language instructions, and act in photo-realistic unseen environments. Most of the existing methods employ the entire image or object features to represent navigable viewpoints. However, these representations are insufficient for proper action prediction, especially for the REVERIE task… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  28. arXiv:2406.00345  [pdf, other

    cs.CV cs.LG

    DeCoOp: Robust Prompt Tuning with Out-of-Distribution Detection

    Authors: Zhi Zhou, Ming Yang, Jiang-Xin Shi, Lan-Zhe Guo, Yu-Feng Li

    Abstract: Vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot capabilities for various downstream tasks. Their performance can be further enhanced through few-shot prompt tuning methods. However, current studies evaluate the performance of learned prompts separately on base and new classes. This evaluation lacks practicality for real-world applications since downstream tasks… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: Accepted by ICML 2024. Code is available at: https://wnjxyk.github.io/DeCoOp

  29. arXiv:2405.16746  [pdf, other

    cs.SE

    Ecosystem of Large Language Models for Code

    Authors: Zhou Yang, Jieke Shi, David Lo

    Abstract: The availability of vast amounts of publicly accessible data of source code and the advances in modern language models, coupled with increasing computational resources, have led to a remarkable surge in the development of large language models for code (LLM4Code, for short). The interaction between code datasets and models gives rise to a complex ecosystem characterized by intricate dependencies t… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

    Comments: Working in progress

  30. arXiv:2405.15124  [pdf, other

    cs.LG cs.AI

    Scaling Law for Time Series Forecasting

    Authors: Jingzhe Shi, Qinwei Ma, Huan Ma, Lei Li

    Abstract: Scaling law that rewards large datasets, complex models and enhanced data granularity has been observed in various fields of deep learning. Yet, studies on time series forecasting have cast doubt on scaling behaviors of deep learning methods for time series forecasting: while more training data improves performance, more capable models do not always outperform less capable models, and longer input… ▽ More

    Submitted 26 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: 20 pages

  31. arXiv:2405.14828  [pdf, other

    cs.CV

    Good Seed Makes a Good Crop: Discovering Secret Seeds in Text-to-Image Diffusion Models

    Authors: Katherine Xu, Lingzhi Zhang, Jianbo Shi

    Abstract: Recent advances in text-to-image (T2I) diffusion models have facilitated creative and photorealistic image synthesis. By varying the random seeds, we can generate various images for a fixed text prompt. Technically, the seed controls the initial noise and, in multi-step diffusion inference, the noise used for reparameterization at intermediate timesteps in the reverse diffusion process. However, t… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  32. arXiv:2405.11922  [pdf, other

    cs.SI cs.LG

    Effective Clustering on Large Attributed Bipartite Graphs

    Authors: Renchi Yang, Yidu Wu, Xiaoyang Lin, Qichen Wang, Tsz Nam Chan, Jieming Shi

    Abstract: Attributed bipartite graphs (ABGs) are an expressive data model for describing the interactions between two sets of heterogeneous nodes that are associated with rich attributes, such as customer-product purchase networks and author-paper authorship graphs. Partitioning the target node set in such graphs into k disjoint clusters (referred to as k-ABGC) finds widespread use in various domains, inclu… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

    Comments: The technical report for the paper was accepted to KDD 2024. 14 pages

  33. arXiv:2405.11727  [pdf, other

    cs.LG

    Highway Graph to Accelerate Reinforcement Learning

    Authors: Zidu Yin, Zhen Zhang, Dong Gong, Stefano V. Albrecht, Javen Q. Shi

    Abstract: Reinforcement Learning (RL) algorithms often suffer from low training efficiency. A strategy to mitigate this issue is to incorporate a model-based planning algorithm, such as Monte Carlo Tree Search (MCTS) or Value Iteration (VI), into the environmental model. The major limitation of VI is the need to iterate over a large tensor. These still lead to intensive computations. We focus on improving t… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

    Comments: 28 pages, 17 figures, 3 tables, TMLR

  34. arXiv:2405.09308  [pdf, other

    cs.LG cs.AI

    TimeX++: Learning Time-Series Explanations with Information Bottleneck

    Authors: Zichuan Liu, Tianchun Wang, Jimeng Shi, Xu Zheng, Zhuomin Chen, Lei Song, Wenqian Dong, Jayantha Obeysekera, Farhad Shirani, Dongsheng Luo

    Abstract: Explaining deep learning models operating on time series data is crucial in various applications of interest which require interpretable and transparent insights from time series signals. In this work, we investigate this problem from an information theoretic perspective and show that most existing measures of explainability may suffer from trivial solutions and distributional shift issues. To add… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: Accepted by International Conference on Machine Learning (ICML 2024)

  35. arXiv:2405.06914  [pdf, other

    cs.CV

    Non-confusing Generation of Customized Concepts in Diffusion Models

    Authors: Wang Lin, Jingyuan Chen, Jiaxin Shi, Yichen Zhu, Chen Liang, Junzhong Miao, Tao Jin, Zhou Zhao, Fei Wu, Shuicheng Yan, Hanwang Zhang

    Abstract: We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs). It becomes even more pronounced in the generation of customized concepts, due to the scarcity of user-provided concept visual examples. By revisiting the two major stages leading to the success of TGDMs -- 1) contrastive image-language pre-training (CLIP)… ▽ More

    Submitted 11 May, 2024; originally announced May 2024.

  36. arXiv:2405.06545  [pdf, other

    cs.CL cs.LG

    Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval

    Authors: Mengjia Niu, Hao Li, Jie Shi, Hamed Haddadi, Fan Mo

    Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various domains, although their susceptibility to hallucination poses significant challenges for their deployment in critical areas such as healthcare. To address this issue, retrieving relevant facts from knowledge graphs (KGs) is considered a promising method. Existing KG-augmented approaches tend to be resource-intens… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    ACM Class: I.2.7; H.3.3

  37. arXiv:2405.05648  [pdf, other

    cs.RO cs.CV

    ASGrasp: Generalizable Transparent Object Reconstruction and Grasping from RGB-D Active Stereo Camera

    Authors: Jun Shi, Yong A, Yixiang Jin, Dingzhe Li, Haoyu Niu, Zhezhu Jin, He Wang

    Abstract: In this paper, we tackle the problem of grasping transparent and specular objects. This issue holds importance, yet it remains unsolved within the field of robotics due to failure of recover their accurate geometry by depth cameras. For the first time, we propose ASGrasp, a 6-DoF grasp detection network that uses an RGB-D active stereo camera. ASGrasp utilizes a two-layer learning-based stereo net… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

    Comments: IEEE International Conference on Robotics and Automation (ICRA), 2024

  38. arXiv:2405.05613  [pdf, other

    cs.CV

    Robust Pseudo-label Learning with Neighbor Relation for Unsupervised Visible-Infrared Person Re-Identification

    Authors: Xiangbo Yin, Jiangming Shi, Yachao Zhang, Yang Lu, Zhizhong Zhang, Yuan Xie, Yanyun Qu

    Abstract: Unsupervised Visible-Infrared Person Re-identification (USVI-ReID) presents a formidable challenge, which aims to match pedestrian images across visible and infrared modalities without any annotations. Recently, clustered pseudo-label methods have become predominant in USVI-ReID, although the inherent noise in pseudo-labels presents a significant obstacle. Most existing works primarily focus on sh… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  39. arXiv:2405.05244  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

    Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

    Abstract: The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: Evaluation plan of the SVDD Challenge @ SLT 2024

  40. arXiv:2405.04309  [pdf, other

    cs.CV cs.AI

    Non-rigid Structure-from-Motion: Temporally-smooth Procrustean Alignment and Spatially-variant Deformation Modeling

    Authors: Jiawei Shi, Hui Deng, Yuchao Dai

    Abstract: Even though Non-rigid Structure-from-Motion (NRSfM) has been extensively studied and great progress has been made, there are still key challenges that hinder their broad real-world applications: 1) the inherent motion/rotation ambiguity requires either explicit camera motion recovery with extra constraint or complex Procrustean Alignment; 2) existing low-rank modeling of the global shape can over-… ▽ More

    Submitted 23 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

    Comments: Accepted by CVPR 2024; V2 adds new experiments

  41. arXiv:2405.03098  [pdf, other

    cs.CL

    FairMonitor: A Dual-framework for Detecting Stereotypes and Biases in Large Language Models

    Authors: Yanhong Bai, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xingjiao Wu, Liang He

    Abstract: Detecting stereotypes and biases in Large Language Models (LLMs) is crucial for enhancing fairness and reducing adverse impacts on individuals or groups when these models are applied. Traditional methods, which rely on embedding spaces or are based on probability metrics, fall short in revealing the nuanced and implicit biases present in various contexts. To address this challenge, we propose the… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

  42. SR4ZCT: Self-supervised Through-plane Resolution Enhancement for CT Images with Arbitrary Resolution and Overlap

    Authors: Jiayang Shi, Daniel M. Pelt, K. Joost Batenburg

    Abstract: Computed tomography (CT) is a widely used non-invasive medical imaging technique for disease diagnosis. The diagnostic accuracy is often affected by image resolution, which can be insufficient in practice. For medical CT images, the through-plane resolution is often worse than the in-plane resolution and there can be overlap between slices, causing difficulties in diagnoses. Self-supervised method… ▽ More

    Submitted 3 May, 2024; originally announced May 2024.

    Comments: MLMI2023

  43. arXiv:2405.02509  [pdf, other

    cs.CV cs.AI

    Implicit Neural Representations for Robust Joint Sparse-View CT Reconstruction

    Authors: Jiayang Shi, Junyi Zhu, Daniel M. Pelt, K. Joost Batenburg, Matthew B. Blaschko

    Abstract: Computed Tomography (CT) is pivotal in industrial quality control and medical diagnostics. Sparse-view CT, offering reduced ionizing radiation, faces challenges due to its under-sampled nature, leading to ill-posed reconstruction problems. Recent advancements in Implicit Neural Representations (INRs) have shown promise in addressing sparse-view CT reconstruction. Recognizing that CT often involves… ▽ More

    Submitted 3 May, 2024; originally announced May 2024.

  44. arXiv:2405.01927  [pdf, other

    cs.LG

    SlotGAT: Slot-based Message Passing for Heterogeneous Graph Neural Network

    Authors: Ziang Zhou, Jieming Shi, Renchi Yang, Yuanhang Zou, Qing Li

    Abstract: Heterogeneous graphs are ubiquitous to model complex data. There are urgent needs on powerful heterogeneous graph neural networks to effectively support important applications. We identify a potential semantic mixing issue in existing message passing processes, where the representations of the neighbors of a node $v$ are forced to be transformed to the feature space of $v$ for aggregation, though… ▽ More

    Submitted 3 May, 2024; originally announced May 2024.

    Comments: Published as a conference paper at ICML 2023

  45. arXiv:2405.00753  [pdf, other

    q-bio.QM cs.AI

    HMAMP: Hypervolume-Driven Multi-Objective Antimicrobial Peptides Design

    Authors: Li Wang, Yiping Li, Xiangzheng Fu, Xiucai Ye, Junfeng Shi, Gary G. Yen, Xiangxiang Zeng

    Abstract: Antimicrobial peptides (AMPs) have exhibited unprecedented potential as biomaterials in combating multidrug-resistant bacteria. Despite the increasing adoption of artificial intelligence for novel AMP design, challenges pertaining to conflicting attributes such as activity, hemolysis, and toxicity have significantly impeded the progress of researchers. This paper introduces a paradigm shift by con… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  46. arXiv:2405.00452  [pdf, other

    cs.CV

    Predictive Accuracy-Based Active Learning for Medical Image Segmentation

    Authors: Jun Shi, Shulan Ruan, Ziqi Zhu, Minfan Zhao, Hong An, Xudong Xue, Bing Yan

    Abstract: Active learning is considered a viable solution to alleviate the contradiction between the high dependency of deep learning-based segmentation methods on annotated data and the expensive pixel-level annotation cost of medical images. However, most existing methods suffer from unreliable uncertainty assessment and the struggle to balance diversity and informativeness, leading to poor performance in… ▽ More

    Submitted 29 June, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: 9 pages, 4 figures

  47. arXiv:2404.18948  [pdf, other

    cs.LG

    Sub-Adjacent Transformer: Improving Time Series Anomaly Detection with Reconstruction Error from Sub-Adjacent Neighborhoods

    Authors: Wenzhen Yue, Xianghua Ying, Ruohao Guo, DongDong Chen, Ji Shi, Bowei Xing, Yuqing Zhu, Taiyan Chen

    Abstract: In this paper, we present the Sub-Adjacent Transformer with a novel attention mechanism for unsupervised time series anomaly detection. Unlike previous approaches that rely on all the points within some neighborhood for time point reconstruction, our method restricts the attention to regions not immediately adjacent to the target points, termed sub-adjacent neighborhoods. Our key observation is th… ▽ More

    Submitted 27 April, 2024; originally announced April 2024.

    Comments: IJCAI 2024

  48. arXiv:2404.18201  [pdf, other

    cs.RO

    What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

    Authors: Dingzhe Li, Yixiang Jin, Yong A, Hongze Yu, Jun Shi, Xiaoshuai Hao, Peng Hao, Huaping Liu, Fuchun Sun, Bin Fang

    Abstract: The realization of universal robots is an ultimate goal of researchers. However, a key hurdle in achieving this goal lies in the robots' ability to manipulate objects in their unstructured surrounding environments according to different tasks. The learning-based approach is considered an effective way to address generalization. The impressive performance of foundation models in the fields of compu… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  49. arXiv:2404.16366  [pdf, other

    cs.LG cs.AI

    Guarding Graph Neural Networks for Unsupervised Graph Anomaly Detection

    Authors: Yuanchen Bei, Sheng Zhou, Jinke Shi, Yao Ma, Haishuai Wang, Jiajun Bu

    Abstract: Unsupervised graph anomaly detection aims at identifying rare patterns that deviate from the majority in a graph without the aid of labels, which is important for a variety of real-world applications. Recent advances have utilized Graph Neural Networks (GNNs) to learn effective node representations by aggregating information from neighborhoods. This is motivated by the hypothesis that nodes in the… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: 14 pages, 9 figures

  50. arXiv:2404.14715  [pdf, other

    cs.CV cs.CL

    FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

    Authors: Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, Jiebo Luo

    Abstract: Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To add… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.