Skip to main content

Showing 1–50 of 1,310 results for author: Jiang, Y

  1. arXiv:2407.03146  [pdf, other

    cs.CY cs.AI cs.CV cs.GT cs.LG

    Enhancing Class Fairness in Classification with A Two-Player Game Approach

    Authors: Yunpeng Jiang, Paul Weng, Yutong Ban

    Abstract: Data augmentation is widely applied and has shown its benefits in different machine learning tasks. However, as recently observed in some downstream tasks, data augmentation may introduce an unfair impact on classifications. While it can improve the performance of some classes, it can actually be detrimental for other classes, which can be problematic in some application domains. In this paper, to… ▽ More

    Submitted 30 May, 2024; originally announced July 2024.

  2. arXiv:2407.02763  [pdf, other

    cs.CV

    ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

    Authors: Yanfeng Jiang, Ning Sun, Xueshuo Xie, Fei Yang, Tao Li

    Abstract: Vision Transformers (ViTs) have exhibited exceptional performance across diverse computer vision tasks, while their substantial parameter size incurs significantly increased memory and computational demands, impeding effective inference on resource-constrained devices. Quantization has emerged as a promising solution to mitigate these challenges, yet existing methods still suffer from significant… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 28 pages,9 figures

  3. arXiv:2407.02530  [pdf, ps, other

    quant-ph cs.DS

    Unifying quantum spatial search, state transfer and uniform sampling on graphs: simple and exact

    Authors: Qingwen Wang, Ying Jiang, Lvzhou Li

    Abstract: This article presents a novel and succinct algorithmic framework via alternating quantum walks, unifying quantum spatial search, state transfer and uniform sampling on a large class of graphs. Using the framework, we can achieve exact uniform sampling over all vertices and perfect state transfer between any two vertices, provided that eigenvalues of Laplacian matrix of the graph are all integers.… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: This manuscript has some overlap with arXiv:2307.16133. More precisely, it is an advanced version of arXiv:2307.16133, which not only modifies the paper structure and some results but also adds several new results

  4. arXiv:2407.01523  [pdf, other

    cs.CV cs.CL

    MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

    Authors: Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun

    Abstract: Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark co… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  5. arXiv:2407.00942  [pdf, other

    cs.IR cs.AI cs.CL

    ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions

    Authors: Jingheng Ye, Yong Jiang, Xiaobin Wang, Yinghui Li, Yangning Li, Hai-Tao Zheng, Pengjun Xie, Fei Huang

    Abstract: This paper introduces the task of product demand clarification within an e-commercial scenario, where the user commences the conversation with ambiguous queries and the task-oriented agent is designed to achieve more accurate and tailored product searching by asking clarification questions. To address this task, we propose ProductAgent, a conversational information seeking agent equipped with abil… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: 17 pages, 13 tables, 6 figures. Under review

  6. arXiv:2407.00719  [pdf

    cs.CR cs.DC cs.LG

    A Whole-Process Certifiably Robust Aggregation Method Against Backdoor Attacks in Federated Learning

    Authors: Anqi Zhou, Yezheng Liu, Yidong Chai, Hongyi Zhu, Xinyue Ge, Yuanchun Jiang, Meng Wang

    Abstract: Federated Learning (FL) has garnered widespread adoption across various domains such as finance, healthcare, and cybersecurity. Nonetheless, FL remains under significant threat from backdoor attacks, wherein malicious actors insert triggers into trained models, enabling them to perform certain tasks while still meeting FL's primary objectives. In response, robust aggregation methods have been prop… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: 14 pages

  7. arXiv:2407.00717  [pdf, other

    cs.LG cs.AI eess.SY

    Learning System Dynamics without Forgetting

    Authors: Xikun Zhang, Dongjin Song, Yushan Jiang, Yixin Chen, Dacheng Tao

    Abstract: Predicting the trajectories of systems with unknown dynamics (\textit{i.e.} the governing rules) is crucial in various research fields, including physics and biology. This challenge has gathered significant attention from diverse communities. Most existing works focus on learning fixed system dynamics within one single system. However, real-world applications often involve multiple systems with di… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  8. UWBAD: Towards Effective and Imperceptible Jamming Attacks Against UWB Ranging Systems with COTS Chips

    Authors: Yuqiao Yang, Zhongjie Wu, Yongzhao Zhang, Ting Chen, Jun Li, Jie Yang, Wenhao Liu, Xiaosong Zhang, Ruicong Shi, Jingwei Li, Yu Jiang, Zhuo Su

    Abstract: UWB ranging systems have been adopted in many critical and security sensitive applications due to its precise positioning and secure ranging capabilities. We present a practical jamming attack, namely UWBAD, against commercial UWB ranging systems, which exploits the vulnerability of the adoption of the normalized cross-correlation process in UWB ranging and can selectively and quickly block rangin… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security

  9. arXiv:2407.00326  [pdf, other

    cs.DC cs.AI cs.NI

    Teola: Towards End-to-End Optimization of LLM-based Applications

    Authors: Xin Tan, Yimin Jiang, Yitao Yang, Hong Xu

    Abstract: Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling dec… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  10. arXiv:2407.00315  [pdf, other

    cs.CV

    Learning Unsupervised Gaze Representation via Eye Mask Driven Information Bottleneck

    Authors: Yangzhou Jiang, Yinxin Lin, Yaoming Wang, Teng Li, Bilian Ke, Bingbing Ni

    Abstract: Appearance-based supervised methods with full-face image input have made tremendous advances in recent gaze estimation tasks. However, intensive human annotation requirement inhibits current methods from achieving industrial level accuracy and robustness. Although current unsupervised pre-training frameworks have achieved success in many image recognition tasks, due to the deep coupling between fa… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: 12 pages, 6 figures, 7 tables

  11. arXiv:2406.19964  [pdf, other

    cs.CR

    Secure Outsourced Decryption for HE-based Privacy-preserving Cloud Computing System

    Authors: Xirong Ma, Chuan Li, Yuchang Hu, Yunting Tao, Yali Jiang, Yanbin Li, Fanyu Kong, Chunpeng Ge

    Abstract: The demand for processing vast volumes of data has surged dramatically due to the advancement of machine learning technology. Large-scale data processing necessitates substantial computational resources, prompting individuals and enterprises to turn to cloud services. Accompanying this trend is a growing concern regarding data leakage and misuse. Homomorphic encryption (HE) is one solution for saf… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  12. arXiv:2406.18532  [pdf, other

    cs.CL cs.AI cs.LG

    Symbolic Learning Enables Self-Evolving Agents

    Authors: Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, Yuchen Eleanor Jiang

    Abstract: The AI community has been exploring a pathway to artificial general intelligence (AGI) by developing "language agents", which are complex large language models (LLMs) pipelines involving both prompting techniques and tool usage methods. While language agents have demonstrated impressive capabilities for many real-world tasks, a fundamental limitation of current language agents research is that the… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Code available at https://github.com/aiwaves-cn/agents

  13. arXiv:2406.18400  [pdf, other

    cs.CL cs.LG stat.ML

    Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

    Authors: Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, Bryon Aragam

    Abstract: Large Language Models (LLMs) have the capacity to store and recall facts. Through experimentation with open-source models, we observe that this ability to retrieve facts can be easily manipulated by changing contexts, even without altering their factual meanings. These findings highlight that LLMs might behave like an associative memory model where certain tokens in the contexts serve as clues to… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  14. arXiv:2406.18079  [pdf, other

    cs.CV eess.IV

    MFDNet: Multi-Frequency Deflare Network for Efficient Nighttime Flare Removal

    Authors: Yiguo Jiang, Xuhang Chen, Chi-Man Pun, Shuqiang Wang, Wei Feng

    Abstract: When light is scattered or reflected accidentally in the lens, flare artifacts may appear in the captured photos, affecting the photos' visual quality. The main challenge in flare removal is to eliminate various flare artifacts while preserving the original content of the image. To address this challenge, we propose a lightweight Multi-Frequency Deflare Network (MFDNet) based on the Laplacian Pyra… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted by The Visual Computer journal

  15. arXiv:2406.17518  [pdf, other

    cs.AI cs.SI

    Enhancing Explainability of Knowledge Learning Paths: Causal Knowledge Networks

    Authors: Yuang Wei, Yizhou Zhou, Yuan-Hao Jiang, Bo Jiang

    Abstract: A reliable knowledge structure is a prerequisite for building effective adaptive learning systems and intelligent tutoring systems. Pursuing an explainable and trustworthy knowledge structure, we propose a method for constructing causal knowledge networks. This approach leverages Bayesian networks as a foundation and incorporates causal relationship analysis to derive a causal network. Additionall… ▽ More

    Submitted 25 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

    Comments: 8 pages, 3 figures, Educational Data Mining 2024, Human-Centric eXplainable AI in Education

  16. arXiv:2406.14555  [pdf, other

    cs.CV

    A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

    Authors: Xincheng Shuai, Henghui Ding, Xingjun Ma, Rongcheng Tu, Yu-Gang Jiang, Dacheng Tao

    Abstract: Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. Th… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Project Page: https://github.com/xinchengshuai/Awesome-Image-Editing

  17. arXiv:2406.14090  [pdf, other

    cs.AI

    Personalized Music Recommendation with a Heterogeneity-aware Deep Bayesian Network

    Authors: Erkang Jing, Yezheng Liu, Yidong Chai, Shuo Yu, Longshun Liu, Yuanchun Jiang, Yang Wang

    Abstract: Music recommender systems are crucial in music streaming platforms, providing users with music they would enjoy. Recent studies have shown that user emotions can affect users' music mood preferences. However, existing emotion-aware music recommender systems (EMRSs) explicitly or implicitly assume that users' actual emotional states expressed by an identical emotion word are homogeneous. They also… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: 34 pages, 19 figures

  18. arXiv:2406.12752  [pdf, other

    cs.CR cs.CV cs.LG

    Extracting Training Data from Unconditional Diffusion Models

    Authors: Yunhao Chen, Xingjun Ma, Difan Zou, Yu-Gang Jiang

    Abstract: As diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI), the study of their memorization of the raw training data has attracted growing attention. Existing works in this direction aim to establish an understanding of whether or to what extent DPMs learn by memorization. Such an understanding is crucial for identifying potential r… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  19. arXiv:2406.12429  [pdf, other

    cs.AI

    Adaptive Selection for Homogeneous Tools: An Instantiation in the RAG Scenario

    Authors: Feiteng Mu, Yong Jiang, Liwen Zhang, Chu Liu, Wenjie Li, Pengjun Xie, Fei Huang

    Abstract: Current research on tool learning primarily focuses on selecting the most effective tool from a wide array of options, often overlooking cost-effectiveness, a crucial factor in human problem-solving. In this paper, we address the selection of homogeneous tools by predicting both their performance and the associated cost required to accomplish a given task. We then assign queries to the optimal too… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  20. arXiv:2406.12360  [pdf, other

    cs.LG

    UrbanLLM: Autonomous Urban Activity Planning and Management with Large Language Models

    Authors: Yue Jiang, Qin Chao, Yile Chen, Xiucheng Li, Shuai Liu, Gao Cong

    Abstract: Location-based services play an critical role in improving the quality of our daily lives. Despite the proliferation of numerous specialized AI models within spatio-temporal context of location-based services, these models struggle to autonomously tackle problems regarding complex urban planing and management. To bridge this gap, we introduce UrbanLLM, a fine-tuned large language model (LLM) desig… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Submitted for EMNLP 2024

  21. arXiv:2406.12282  [pdf, other

    cs.LG

    SAGDFN: A Scalable Adaptive Graph Diffusion Forecasting Network for Multivariate Time Series Forecasting

    Authors: Yue Jiang, Xiucheng Li, Yile Chen, Shuai Liu, Weilong Kong, Antonis F. Lentzakis, Gao Cong

    Abstract: Time series forecasting is essential for our daily activities and precise modeling of the complex correlations and shared patterns among multiple time series is essential for improving forecasting performance. Spatial-Temporal Graph Neural Networks (STGNNs) are widely used in multivariate time series forecasting tasks and have achieved promising performance on multiple real-world datasets for thei… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted at ICDE 2024

  22. arXiv:2406.12009  [pdf, other

    cs.CL

    FinTruthQA: A Benchmark Dataset for Evaluating the Quality of Financial Information Disclosure

    Authors: Ziyue Xu, Peilin Zhou, Xinyu Shi, Jiageng Wu, Yikang Jiang, Bin Ke, Jie Yang

    Abstract: Accurate and transparent financial information disclosure is crucial in the fields of accounting and finance, ensuring market efficiency and investor confidence. Among many information disclosure platforms, the Chinese stock exchanges' investor interactive platform provides a novel and interactive way for listed firms to disclose information of interest to investors through an online question-and-… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  23. arXiv:2406.11781  [pdf, other

    cs.IR

    DiffMM: Multi-Modal Diffusion Model for Recommendation

    Authors: Yangqin Jiang, Lianghao Xia, Wei Wei, Da Luo, Kangyi Lin, Chao Huang

    Abstract: The rise of online multi-modal sharing platforms like TikTok and YouTube has enabled personalized recommender systems to incorporate multiple modalities (such as visual, textual, and acoustic) into user representations. However, addressing the challenge of data sparsity in these systems remains a key issue. To address this limitation, recent research has introduced self-supervised learning techniq… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  24. arXiv:2406.11739  [pdf, other

    cs.CV

    V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results

    Authors: Jiaqi Wang, Yuhang Zang, Pan Zhang, Tao Chu, Yuhang Cao, Zeyi Sun, Ziyu Liu, Xiaoyi Dong, Tong Wu, Dahua Lin, Zeming Chen, Zhi Wang, Lingchen Meng, Wenhao Yao, Jianwei Yang, Sihong Wu, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Peixi Wu, Bosong Chai, Xuan Nie, Longquan Yan, Zeyu Wang, Qifan Zhou , et al. (9 additional authors not shown)

    Abstract: Detecting objects in real-world scenes is a complex task due to various challenges, including the vast range of object categories, and potential encounters with previously unknown or unseen objects. The challenges necessitate the development of public benchmarks and challenges to advance the field of object detection. Inspired by the success of previous COCO and LVIS Challenges, we organize the V3… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  25. arXiv:2406.11451  [pdf, other

    cs.CV

    MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

    Authors: Yue Jiang, Jiawei Chen, Dingkang Yang, Mingcheng Li, Shunli Wang, Tong Wu, Ke Li, Lihua Zhang

    Abstract: When Large Vision Language Models (LVLMs) are applied to multimodal medical generative tasks, they suffer from significant model hallucination issues. This severely impairs the model's generative accuracy, making it challenging for LVLMs to be implemented in real-world medical scenarios to assist doctors in diagnosis. Enhancing the training data for downstream medical generative tasks is an effect… ▽ More

    Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  26. arXiv:2406.11410  [pdf, other

    cs.CL cs.AI

    HARE: HumAn pRiors, a key to small language model Efficiency

    Authors: Lingyun Zhang, Bin jin, Gaojian Ge, Lunhui Liu, Xuewen Shen, Mingyong Wu, Houqian Zhang, Yongneng Jiang, Shiqi Chen, Shi Pu

    Abstract: Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale… ▽ More

    Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  27. arXiv:2406.11172  [pdf, other

    cs.CL

    Enhancing Criminal Case Matching through Diverse Legal Factors

    Authors: Jie Zhao, Ziyu Guan, Wei Zhao, Yue Jiang

    Abstract: Criminal case matching endeavors to determine the relevance between different criminal cases. Conventional methods predict the relevance solely based on instance-level semantic features and neglect the diverse legal factors (LFs), which are associated with diverse court judgments. Consequently, comprehensively representing a criminal case remains a challenge for these approaches. Moreover, extract… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  28. arXiv:2406.10928  [pdf, other

    cs.CR cs.AI cs.NI

    Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask

    Authors: Jingyu Xiao, Zhiyao Xu, Qingsong Zou, Qing Li, Dan Zhao, Dong Fang, Ruoyu Li, Wenxin Tang, Kang Li, Xudong Zuo, Penghui Hu, Yong Jiang, Zixuan Weng, Michael R. Lyv

    Abstract: Smart homes, powered by the Internet of Things, offer great convenience but also pose security concerns due to abnormal behaviors, such as improper operations of users and potential attacks from malicious attackers. Several behavior modeling methods have been proposed to identify abnormal behaviors and mitigate potential risks. However, their performance often falls short because they do not effec… ▽ More

    Submitted 18 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

    Comments: KDD 2024

  29. arXiv:2406.10185  [pdf, other

    cs.CV

    Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

    Authors: Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, Lihua Zhang

    Abstract: Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is mi… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  30. arXiv:2406.09905  [pdf, other

    cs.CV cs.GR

    Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

    Authors: Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, Kevin Bailey, David Soriano Fosas, C. Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, Richard Newcombe

    Abstract: We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. The dataset comes with a) full-body 3D motion ground truth; b) egocentric multimodal recordings from Project Aria devices with RGB, grayscale, eye-tracking cameras, IMUs, magnetometer, barometer, and microphones; and c) an additional "observer" dev… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  31. arXiv:2406.09873  [pdf, other

    eess.AS cs.AI cs.SD

    Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition

    Authors: Yicong Jiang, Tianzi Wang, Xurong Xie, Juan Liu, Wei Sun, Nan Yan, Hui Chen, Lan Wang, Xunying Liu, Feng Tian

    Abstract: Disordered speech recognition profound implications for improving the quality of life for individuals afflicted with, for example, dysarthria. Dysarthric speech recognition encounters challenges including limited data, substantial dissimilarities between dysarthric and non-dysarthric speakers, and significant speaker variations stemming from the disorder. This paper introduces Perceiver-Prompt, a… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted by interspeech 2024

  32. arXiv:2406.09399  [pdf, other

    cs.CV

    OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

    Authors: Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, Yu-Gang Jiang

    Abstract: Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled archite… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  33. arXiv:2406.08116  [pdf, other

    cs.CL cs.AI

    Supportiveness-based Knowledge Rewriting for Retrieval-augmented Language Modeling

    Authors: Zile Qiao, Wei Ye, Yong Jiang, Tong Mo, Pengjun Xie, Weiping Li, Fei Huang, Shikun Zhang

    Abstract: Retrieval-augmented language models (RALMs) have recently shown great potential in mitigating the limitations of implicit knowledge in LLMs, such as untimely updating of the latest expertise and unreliable retention of long-tail knowledge. However, since the external knowledge base, as well as the retriever, can not guarantee reliability, potentially leading to the knowledge retrieved not being he… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  34. arXiv:2406.07428  [pdf, other

    cs.GT cs.AI cs.LG

    GemNet: Menu-Based, Strategy-Proof Multi-Bidder Auctions Through Deep Learning

    Authors: Tonghan Wang, Yanchen Jiang, David C. Parkes

    Abstract: Differentiable economics uses deep learning for automated mechanism design. Despite strong progress, it has remained an open problem to learn multi-bidder, general, and fully strategy-proof (SP) auctions. We introduce GEneral Menu-based NETwork (GemNet), which significantly extends the menu-based approach of RochetNet [Dütting et al., 2023] to the multi-bidder setting. The challenge in achieving S… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  35. arXiv:2406.07236  [pdf, other

    cs.LG

    Let Go of Your Labels with Unsupervised Transfer

    Authors: Artyom Gadetsky, Yulun Jiang, Maria Brbic

    Abstract: Foundation vision-language models have enabled remarkable zero-shot transferability of the pre-trained representations to a wide range of downstream tasks. However, to solve a new task, zero-shot transfer still necessitates human guidance to define visual categories that appear in the data. Here, we show that fully unsupervised transfer emerges when searching for the labeling of a dataset that ind… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: ICML 2024 camera-ready

  36. arXiv:2406.07198  [pdf, other

    eess.AS cs.MM

    Target Speech Diarization with Multimodal Prompts

    Authors: Yidi Jiang, Ruijie Tao, Zhengyang Chen, Yanmin Qian, Haizhou Li

    Abstract: Traditional speaker diarization seeks to detect ``who spoke when'' according to speaker characteristics. Extending to target speech diarization, we detect ``when target event occurs'' according to the semantic characteristics of speech. We propose a novel Multimodal Target Speech Diarization (MM-TSD) framework, which accommodates diverse and multi-modal prompts to specify target events in a flexib… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 13 pages, 7 figures

  37. arXiv:2406.07091  [pdf, other

    cs.CV

    AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

    Authors: Xing Zhang, Jiaxi Gu, Haoyu Zhao, Shicong Wang, Hang Xu, Renjing Pei, Songcen Xu, Zuxuan Wu, Yu-Gang Jiang

    Abstract: Temporal Video Grounding (TVG) aims to localize a moment from an untrimmed video given the language description. Since the annotation of TVG is labor-intensive, TVG under limited supervision has accepted attention in recent years. The great success of vision-language pre-training guides TVG to follow the traditional "pre-training + fine-tuning" paradigm, however, the pre-training process would suf… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Technique Report

  38. arXiv:2406.07085  [pdf, other

    cs.CV

    CAT: Coordinating Anatomical-Textual Prompts for Multi-Organ and Tumor Segmentation

    Authors: Zhongzhen Huang, Yankai Jiang, Rongzhao Zhang, Shaoting Zhang, Xiaofan Zhang

    Abstract: Existing promptable segmentation methods in the medical imaging field primarily consider either textual or visual prompts to segment relevant objects, yet they often fall short when addressing anomalies in medical images, like tumors, which may vary greatly in shape, size, and appearance. Recognizing the complexity of medical scenarios and the limitations of textual or visual prompts, we propose a… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  39. arXiv:2406.06978  [pdf, other

    cs.CV

    Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    Authors: Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, Yu-Gang Jiang, Jose M. Alvarez

    Abstract: We propose Hydra-MDP, a novel paradigm employing multiple teachers in a teacher-student model. This approach uses knowledge distillation from both human and rule-based teachers to train the student model, which features a multi-head decoder to learn diverse trajectory candidates tailored to various evaluation metrics. With the knowledge of rule-based teachers, Hydra-MDP learns how the environment… ▽ More

    Submitted 19 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: The 1st place solution of End-to-end Driving at Scale at the CVPR 2024 Autonomous Grand Challenge

  40. arXiv:2406.06525  [pdf, other

    cs.CV

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Authors: Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan

    Abstract: We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spa… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Codes and models: \url{https://github.com/FoundationVision/LlamaGen}

  41. arXiv:2406.06465  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

    Authors: Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, Yu-Gang Jiang

    Abstract: Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction, which has wide applications in virtual reality, robotics, and content creation. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. However, they struggle with frame consistency and temporal stability primarily due to the… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  42. arXiv:2406.06028  [pdf, other

    cs.CV

    ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery

    Authors: Xian Sun, Qiwei Yan, Chubo Deng, Chenglong Liu, Yi Jiang, Zhongyan Hou, Wanxuan Lu, Fanglong Yao, Xiaoyu Liu, Lingxiang Hao, Hongfeng Yu

    Abstract: Scene Graph Generation (SGG) is a high-level visual understanding and reasoning task aimed at extracting entities (such as objects) and their interrelationships from images. Significant progress has been made in the study of SGG in natural images in recent years, but its exploration in the domain of remote sensing images remains very limited. The complex characteristics of remote sensing images ne… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  43. arXiv:2406.05681  [pdf, other

    cs.SD eess.AS

    Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

    Authors: Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang

    Abstract: Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timb… ▽ More

    Submitted 11 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, accepted by Interspeech2024

  44. arXiv:2406.05647  [pdf, other

    eess.SP cs.ET

    Sustainable Wireless Networks via Reconfigurable Intelligent Surfaces (RISs): Overview of the ETSI ISG RIS

    Authors: Ruiqi Liu, Shuang Zheng, Qingqing Wu, Yifan Jiang, Nan Zhang, Yuanwei Liu, Marco Di Renzo, and George C. Alexandropoulos

    Abstract: Reconfigurable Intelligent Surfaces (RISs) are a novel form of ultra-low power devices that are capable to increase the communication data rates as well as the cell coverage in a cost- and energy-efficient way. This is attributed to their programmable operation that enables them to dynamically manipulate the wireless propagation environment, a feature that has lately inspired numerous research inv… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: 7 pages, 5 figures, submitted to an IEEE Magazine

  45. arXiv:2406.04578  [pdf, other

    cs.CL

    SC2: Towards Enhancing Content Preservation and Style Consistency in Long Text Style Transfer

    Authors: Jie Zhao, Ziyu Guan, Cai Xu, Wei Zhao, Yue Jiang

    Abstract: Text style transfer (TST) aims to vary the style polarity of text while preserving the semantic content. Although recent advancements have demonstrated remarkable progress in short TST, it remains a relatively straightforward task with limited practical applications. The more comprehensive long TST task presents two challenges: (1) existing methods encounter difficulties in accurately evaluating c… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  46. arXiv:2406.04466  [pdf, other

    cs.HC

    Dog Heart Rate and Blood Oxygen Metaverse Interaction System

    Authors: Yanhui Jiang, Jin Cao, Chang Yu

    Abstract: This study developed an improved dog heart rate and blood oxygen sensor system using Arduino. Traditional methods face accuracy and reliability issues. Our system integrates advanced computational techniques with hardware-based sensing to enhance measurement precision. An Arduino microcontroller connected to a heart rate and blood oxygen sensor collects raw data, which is preprocessed and filtered… ▽ More

    Submitted 10 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: 7 pages, 6 figures, conference for IEEE metacom accepted (https://ieee-metacom.org/)

  47. arXiv:2406.04465  [pdf, other

    cs.HC

    Rough Set improved Therapy-Based Metaverse Assisting System

    Authors: Jin Cao, Yanhui Jiang, Chang Yu, Feiwei Qin, Zekun Jiang

    Abstract: Chronic neck and shoulder pain (CNSP) is a major global public health issue. Traditional treatments like physiotherapy and rehabilitation have drawbacks, including high costs, low precision, and user discomfort. This paper presents an interactive system based on Cognitive Therapy Theory (CBT) for CNSP treatment. The system includes a pain detection module using EMG and IMU to monitor pain and opti… ▽ More

    Submitted 10 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: 7 pages, 5 figures, conference for IEEE metacom accepted (https://ieee-metacom.org/)

  48. arXiv:2406.04334  [pdf, other

    cs.CV

    DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

    Authors: Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, Yu-Gang Jiang

    Abstract: Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture DeepStack for LMMs. Considering $N$ layers in… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project Page: https://deepstack-vl.github.io/

  49. arXiv:2406.04151  [pdf, other

    cs.AI cs.CL

    AgentGym: Evolving Large Language Model-based Agents across Diverse Environments

    Authors: Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, Songyang Gao, Lu Chen, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang

    Abstract: Building generalist agents that can handle diverse tasks and evolve themselves across different environments is a long-term goal in the AI community. Large language models (LLMs) are considered a promising foundation to build such agents due to their generalized capabilities. Current approaches either have LLM-based agents imitate expert-provided trajectories step-by-step, requiring human supervis… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project site: https://agentgym.github.io

  50. arXiv:2406.03835  [pdf, other

    cs.CV cs.RO

    Monocular Localization with Semantics Map for Autonomous Vehicles

    Authors: Jixiang Wan, Xudong Zhang, Shuzhou Dong, Yuwei Zhang, Yuchen Yang, Ruoxi Wu, Ye Jiang, Jijunnan Li, Jinquan Lin, Ming Yang

    Abstract: Accurate and robust localization remains a significant challenge for autonomous vehicles. The cost of sensors and limitations in local computational efficiency make it difficult to scale to large commercial applications. Traditional vision-based approaches focus on texture features that are susceptible to changes in lighting, season, perspective, and appearance. Additionally, the large storage siz… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.