Skip to main content

Showing 1–50 of 3,381 results for author: Zhang, L

  1. arXiv:2407.03008  [pdf, other

    cs.CV

    Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering

    Authors: Zhaohe Liao, Jiangtong Li, Li Niu, Liqing Zhang

    Abstract: Despite the recent progress made in Video Question-Answering (VideoQA), these methods typically function as black-boxes, making it difficult to understand their reasoning processes and perform consistent compositional reasoning. To address these challenges, we propose a \textit{model-agnostic} Video Alignment and Answer Aggregation (VA$^{3}$) framework, which is capable of enhancing both compositi… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: 10 pages,CVPR

    Journal ref: CVPR (2024) 13395-13404

  2. arXiv:2407.02505  [pdf, other

    cs.CE cs.LG physics.flu-dyn

    A MgNO Method for Multiphase Flow in Porous Media

    Authors: Xinliang Liu, Xia Yang, Chen-Song Zhang, Lian Zhang, Li Zhao

    Abstract: This research investigates the application of Multigrid Neural Operator (MgNO), a neural operator architecture inspired by multigrid methods, in the simulation for multiphase flow within porous media. The architecture is adjusted to manage a variety of crucial factors, such as permeability and porosity heterogeneity. The study extendes MgNO to time-dependent porous media flow problems and validate… ▽ More

    Submitted 16 June, 2024; originally announced July 2024.

  3. arXiv:2407.02392  [pdf, other

    cs.CV

    TokenPacker: Efficient Visual Projector for Multimodal LLM

    Authors: Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jianke Zhu, Lei Zhang

    Abstract: The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significa… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 16 pages, Codes:https://github.com/CircleRadon/TokenPacker

  4. arXiv:2407.02049  [pdf, other

    eess.AS cs.CL cs.SD

    Accompanied Singing Voice Synthesis with Fully Text-controlled Melody

    Authors: Ruiqi Li, Zhiqing Hong, Yongqi Wang, Lichao Zhang, Rongjie Huang, Siqi Zheng, Zhou Zhao

    Abstract: Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achie… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Working in progress

  5. arXiv:2407.02040  [pdf, other

    cs.CV cs.AI cs.MM

    ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

    Authors: Zhiyuan Ma, Yuxiang Wei, Yabin Zhang, Xiangyu Zhu, Zhen Lei, Lei Zhang

    Abstract: By leveraging the text-to-image diffusion priors, score distillation can synthesize 3D contents without paired text-3D training data. Instead of spending hours of online optimization per text prompt, recent studies have been focused on learning a text-to-3D generative network for amortizing multiple text-3D relations, which can synthesize 3D contents in seconds. However, existing score distillatio… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024. Code available at https://github.com/theEricMa/ScaleDreamer

  6. arXiv:2407.02031  [pdf, other

    cs.DC cs.AI cs.LG

    SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules

    Authors: Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang

    Abstract: This paper documents our characterization study and practices for serving text-to-image requests with stable diffusion models in production. We first comprehensively analyze inference request traces for commercial text-to-image applications. It commences with our observation that add-on modules, i.e., ControlNets and LoRAs, that augment the base stable diffusion models, are ubiquitous in generatin… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  7. arXiv:2407.01928  [pdf, other

    cs.CV

    SymPoint Revolutionized: Boosting Panoptic Symbol Spotting with Layer Feature Enhancement

    Authors: Wenlong Liu, Tianyu Yang, Qizhi Yu, Lei Zhang

    Abstract: SymPoint is an initial attempt that utilizes point set representation to solve the panoptic symbol spotting task on CAD drawing. Despite its considerable success, it overlooks graphical layer information and suffers from prohibitively slow training convergence. To tackle this issue, we introduce SymPoint-V2, a robust and efficient solution featuring novel, streamlined designs that overcome these l… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: code at https://github.com/nicehuster/SymPointV2

  8. arXiv:2407.01749  [pdf, other

    cs.LG cs.AI

    Invariant Correlation of Representation with Label

    Authors: Gaojie Jin, Ronghui Mu, Xinping Yi, Xiaowei Huang, Lijun Zhang

    Abstract: The Invariant Risk Minimization (IRM) approach aims to address the challenge of domain generalization by training a feature representation that remains invariant across multiple environments. However, in noisy environments, IRM-related techniques such as IRMv1 and VREx may be unable to achieve the optimal IRM solution, primarily due to erroneous optimization directions. To address this issue, we i… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  9. arXiv:2407.01731  [pdf, other

    cs.CV

    Uncertainty Quantification in Table Structure Recognition

    Authors: Kehinde Ajayi, Leizhen Zhang, Yi He, Jian Wu

    Abstract: Quantifying uncertainties for machine learning models is a critical step to reduce human verification effort by detecting predictions with low confidence. This paper proposes a method for uncertainty quantification (UQ) of table structure recognition (TSR). The proposed UQ method is built upon a mixture-of-expert approach termed Test-Time Augmentation (TTA). Our key idea is to enrich and diversify… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 7 Figures

  10. arXiv:2407.01636  [pdf, other

    cs.CV

    Learning Frequency-Aware Dynamic Transformers for All-In-One Image Restoration

    Authors: Zenglin Shi, Tong Su, Pei Liu, Yunpeng Wu, Le Zhang, Meng Wang

    Abstract: This work aims to tackle the all-in-one image restoration task, which seeks to handle multiple types of degradation with a single model. The primary challenge is to extract degradation representations from the input degraded images and use them to guide the model's adaptation to specific degradation types. Recognizing that various degradations affect image content differently across frequency band… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: 8 pages

  11. arXiv:2407.01489  [pdf, other

    cs.SE cs.AI cs.CL cs.LG

    Agentless: Demystifying LLM-based Software Engineering Agents

    Authors: Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, Lingming Zhang

    Abstract: Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run c… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  12. arXiv:2407.01303  [pdf, other

    cs.RO

    RoDyn-SLAM: Robust Dynamic Dense RGB-D SLAM with Neural Radiance Fields

    Authors: Haochen Jiang, Yueming Xu, Kejie Li, Jianfeng Feng, Li Zhang

    Abstract: Leveraging neural implicit representation to conduct dense RGB-D SLAM has been studied in recent years. However, this approach relies on a static environment assumption and does not work robustly within a dynamic environment due to the inconsistent observation of geometry and photometry. To address the challenges presented in dynamic environments, we propose a novel dynamic SLAM framework with neu… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: IEEE RAL 2024

  13. arXiv:2407.01251  [pdf, other

    cs.CR cs.AI

    QUEEN: Query Unlearning against Model Extraction

    Authors: Huajie Chen, Tianqing Zhu, Lefeng Zhang, Bo Liu, Derui Wang, Wanlei Zhou, Minhui Xue

    Abstract: Model extraction attacks currently pose a non-negligible threat to the security and privacy of deep learning models. By querying the model with a small dataset and usingthe query results as the ground-truth labels, an adversary can steal a piracy model with performance comparable to the original model. Two key issues that cause the threat are, on the one hand, accurate and unlimited queries can be… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  14. arXiv:2407.00658  [pdf, other

    cs.RO

    A Fast Online Omnidirectional Quadrupedal Jumping Framework Via Virtual-Model Control and Minimum Jerk Trajectory Generation

    Authors: Linzhu Yue, Lingwei Zhang, Zhitao Song, Hongbo Zhang, Jinhu Dong, Xuanqi Zeng, Yun-Hui Liu

    Abstract: Exploring the limits of quadruped robot agility, particularly in the context of rapid and real-time planning and execution of omnidirectional jump trajectories, presents significant challenges due to the complex dynamics involved, especially when considering significant impulse contacts. This paper introduces a new framework to enable fast, omnidirectional jumping capabilities for quadruped robots… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: IROS2024 paper,7 pages,8 figures

    MSC Class: 68T40 ACM Class: I.2.9

  15. arXiv:2407.00132  [pdf, other

    cs.SE cs.AI

    ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents

    Authors: Haiyang Shen, Yue Li, Desong Meng, Dongqi Cai, Sheng Qi, Li Zhang, Mengwei Xu, Yun Ma

    Abstract: Recent advancements in integrating large language models (LLMs) with application programming interfaces (APIs) have gained significant interest in both academia and industry. These API-based agents, leveraging the strong autonomy and planning capabilities of LLMs, can efficiently solve problems requiring multi-step actions. However, their ability to handle multi-dimensional difficulty levels, dive… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

  16. arXiv:2407.00073  [pdf, other

    cs.CR

    Provably Secure Non-interactive Key Exchange Protocol for Group-Oriented Applications in Scenarios with Low-Quality Networks

    Authors: Rui Zhang, Lei Zhang

    Abstract: Non-interactive key exchange (NIKE) enables two or multiple parties (just knowing the public system parameters and each other's public key) to derive a (group) session key without the need for interaction. Recently, NIKE in multi-party settings has been attached importance. However, we note that most existing multi-party NIKE protocols, underlying costly cryptographic techniques (i.e., multilinear… ▽ More

    Submitted 21 June, 2024; originally announced July 2024.

  17. arXiv:2406.19853  [pdf, other

    cs.CL cs.AI

    YuLan: An Open-source Large Language Model

    Authors: Yutao Zhu, Kun Zhou, Kelong Mao, Wentong Chen, Yiding Sun, Zhipeng Chen, Qian Cao, Yihan Wu, Yushuo Chen, Feng Wang, Lei Zhang, Junyi Li, Xiaolei Wang, Lei Wang, Beichen Zhang, Zican Dong, Xiaoxue Cheng, Yuhan Chen, Xinyu Tang, Yupeng Hou, Qiangqiang Ren, Xincheng Pang, Shufang Xie, Wayne Xin Zhao, Zhicheng Dou , et al. (13 additional authors not shown)

    Abstract: Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billi… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  18. arXiv:2406.19485  [pdf, other

    eess.IV cs.CV

    GAPNet: Granularity Attention Network with Anatomy-Prior-Constraint for Carotid Artery Segmentation

    Authors: Lin Zhang, Chenggang Lu, Xin-yang Shi, Caifeng Shan, Jiong Zhang, Da Chen, Laurent D. Cohen

    Abstract: Atherosclerosis is a chronic, progressive disease that primarily affects the arterial walls. It is one of the major causes of cardiovascular disease. Magnetic Resonance (MR) black-blood vessel wall imaging (BB-VWI) offers crucial insights into vascular disease diagnosis by clearly visualizing vascular structures. However, the complex anatomy of the neck poses challenges in distinguishing the carot… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  19. arXiv:2406.18603  [pdf, other

    stat.AP cs.LG

    Confidence interval estimation of mixed oil length with conditional diffusion model

    Authors: Yanfeng Yang, Lihong Zhang, Ziqi Chen, Miaomiao Yu, Lei Chen

    Abstract: Accurately estimating the mixed oil length plays a big role in the economic benefit for oil pipeline network. While various proposed methods have tried to predict the mixed oil length, they often exhibit an extremely high probability (around 50\%) of underestimating it. This is attributed to their failure to consider the statistical variability inherent in the estimated length of mixed oil. To add… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  20. arXiv:2406.18550  [pdf, other

    cs.CV cs.AI

    Pre-Trained Vision-Language Models as Partial Annotators

    Authors: Qian-Wei Wang, Yuqiu Xie, Letian Zhang, Zimo Liu, Shu-Tao Xia

    Abstract: Pre-trained vision-language models learn massive data to model unified representations of images and natural languages, which can be widely applied to downstream machine learning tasks. In addition to zero-shot inference, in order to better adapt pre-trained models to the requirements of downstream tasks, people usually use methods such as few-shot or parameter-efficient fine-tuning and knowledge… ▽ More

    Submitted 23 May, 2024; originally announced June 2024.

  21. arXiv:2406.18294  [pdf, other

    cs.CL

    Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs

    Authors: Lei Zhang, Yunshui Li, Jiaming Li, Xiaobo Xia, Jiaxi Yang, Run Luo, Minzheng Wang, Longze Chen, Junhao Liu, Min Yang

    Abstract: Some recently developed code large language models (Code LLMs) have been pre-trained on repository-level code data (Repo-Code LLMs), enabling these models to recognize repository structures and utilize cross-file information for code completion. However, in real-world development scenarios, simply concatenating the entire code repository often exceeds the context window limits of these Repo-Code L… ▽ More

    Submitted 27 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

  22. arXiv:2406.18199  [pdf, other

    cs.CV

    GS-Octree: Octree-based 3D Gaussian Splatting for Robust Object-level 3D Reconstruction Under Strong Lighting

    Authors: Jiaze Li, Zhengyu Wen, Luo Zhang, Jiangbei Hu, Fei Hou, Zhebin Zhang, Ying He

    Abstract: The 3D Gaussian Splatting technique has significantly advanced the construction of radiance fields from multi-view images, enabling real-time rendering. While point-based rasterization effectively reduces computational demands for rendering, it often struggles to accurately reconstruct the geometry of the target object, especially under strong lighting. To address this challenge, we introduce a no… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  23. arXiv:2406.18045  [pdf, other

    cs.CL cs.AI

    PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

    Authors: Linqing Chen, Weilei Wang, Zilong Bai, Peng Xu, Yan Fang, Jie Fang, Wentao Wu, Lizhi Zhou, Ruiji Zhang, Yubin Xia, Chaobo Xu, Ran Hu, Licong Xu, Qijun Cai, Haoran Hua, Jing Sun, Jin Liu, Tian Qiu, Haowen Liu, Meng Hu, Xiuwen Li, Fei Gao, Yufu Wang, Lin Tie, Chaochao Wang , et al. (11 additional authors not shown)

    Abstract: Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general pu… ▽ More

    Submitted 3 July, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

  24. arXiv:2406.18035  [pdf, other

    cs.LG stat.ML

    Local Linear Recovery Guarantee of Deep Neural Networks at Overparameterization

    Authors: Yaoyu Zhang, Leyang Zhang, Zhongwang Zhang, Zhiwei Bai

    Abstract: Determining whether deep neural network (DNN) models can reliably recover target functions at overparameterization is a critical yet complex issue in the theory of deep learning. To advance understanding in this area, we introduce a concept we term "local linear recovery" (LLR), a weaker form of target function recovery that renders the problem more amenable to theoretical analysis. In the sense o… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2211.11623

  25. arXiv:2406.17998  [pdf, other

    cs.CV

    Changen2: Multi-Temporal Remote Sensing Generative Change Foundation Model

    Authors: Zhuo Zheng, Stefano Ermon, Dongjun Kim, Liangpei Zhang, Yanfei Zhong

    Abstract: Our understanding of the temporal dynamics of the Earth's surface has been advanced by deep vision models, which often require lots of labeled multi-temporal images for training. However, collecting, preprocessing, and annotating multi-temporal remote sensing images at scale is non-trivial since it is expensive and knowledge-intensive. In this paper, we present change data generators based on gene… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: The enhanced extension of our ICCV 2023 (Changen)

  26. arXiv:2406.17431  [pdf, other

    cs.SE

    A Large-scale Investigation of Semantically Incompatible APIs behind Compatibility Issues in Android Apps

    Authors: Shidong Pan, Tianchen Guo, Lihong Zhang, Pei Liu, Zhenchang Xing, Xiaoyu Sun

    Abstract: Application Programming Interface (API) incompatibility is a long-standing issue in Android application development. The rapid evolution of Android APIs results in a significant number of API additions, removals, and changes between adjacent versions. Unfortunately, this high frequency of alterations may lead to compatibility issues, often without adequate notification to developers regarding thes… ▽ More

    Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

  27. arXiv:2406.17419  [pdf, other

    cs.CL cs.AI

    Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

    Authors: Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, Yongbin Li

    Abstract: Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually catching up. However, existing benchmarks employ irrelevant noise texts to artificially extend the length of test cases, diverging from the real-world scenarios of long-contex… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: We release our code and data publicly at https://github.com/MozerWang/Loong

  28. arXiv:2406.17396  [pdf, other

    cs.CV

    SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing

    Authors: Ruihuang Li, Liyi Chen, Zhengqiang Zhang, Varun Jampani, Vishal M. Patel, Lei Zhang

    Abstract: Text-based 2D diffusion models have demonstrated impressive capabilities in image generation and editing. Meanwhile, the 2D diffusion models also exhibit substantial potentials for 3D editing tasks. However, how to achieve consistent edits across multiple viewpoints remains a challenge. While the iterative dataset update method is capable of achieving global consistency, it suffers from slow conve… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 16 pages, 13 figures

  29. arXiv:2406.16872  [pdf, other

    eess.SP cs.AI

    Multi-channel Time Series Decomposition Network For Generalizable Sensor-Based Activity Recognition

    Authors: Jianguo Pan, Zhengxin Hu, Lingdun Zhang, Xia Cai

    Abstract: Sensor-based human activity recognition is important in daily scenarios such as smart healthcare and homes due to its non-intrusive privacy and low cost advantages, but the problem of out-of-domain generalization caused by differences in focusing individuals and operating environments can lead to significant accuracy degradation on cross-person behavior recognition due to the inconsistent distribu… ▽ More

    Submitted 28 March, 2024; originally announced June 2024.

  30. CausalMMM: Learning Causal Structure for Marketing Mix Modeling

    Authors: Chang Gong, Di Yao, Lei Zhang, Sheng Chen, Wenbin Li, Yueyang Su, Jingping Bi

    Abstract: In online advertising, marketing mix modeling (MMM) is employed to predict the gross merchandise volume (GMV) of brand shops and help decision-makers to adjust the budget allocation of various advertising channels. Traditional MMM methods leveraging regression techniques can fail in handling the complexity of marketing. Although some efforts try to encode the causal structures for better predictio… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: WSDM 2024, full version

  31. arXiv:2406.16722  [pdf, other

    cs.CL

    Venturing into Uncharted Waters: The Navigation Compass from Transformer to Mamba

    Authors: Yuchen Zou, Yineng Chen, Zuchao Li, Lefei Zhang, Hai Zhao

    Abstract: Transformer, a deep neural network architecture, has long dominated the field of natural language processing and beyond. Nevertheless, the recent introduction of Mamba challenges its supremacy, sparks considerable interest among researchers, and gives rise to a series of Mamba-based models that have exhibited notable potential. This survey paper orchestrates a comprehensive discussion, diving into… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  32. arXiv:2406.16620  [pdf, other

    cs.CV cs.CL

    OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

    Authors: Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, Kyusong Lee

    Abstract: Recent advancements in Large Language Models (LLMs) have expanded their capabilities to multimodal contexts, including comprehensive video understanding. However, processing extensive videos such as 24-hour CCTV footage or full-length films presents significant challenges due to the vast data and processing demands. Traditional methods, like extracting key frames or converting frames to text, ofte… ▽ More

    Submitted 24 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

  33. arXiv:2406.16529  [pdf, other

    cs.CL

    Towards Better Graph-based Cross-document Relation Extraction via Non-bridge Entity Enhancement and Prediction Debiasing

    Authors: Hao Yue, Shaopeng Lai, Chengyi Yang, Liang Zhang, Junfeng Yao, Jinsong Su

    Abstract: Cross-document Relation Extraction aims to predict the relation between target entities located in different documents. In this regard, the dominant models commonly retain useful information for relation prediction via bridge entities, which allows the model to elaborately capture the intrinsic interdependence between target entities. However, these studies ignore the non-bridge entities, each of… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 Findings

  34. arXiv:2406.16221  [pdf, other

    cs.LG cs.AI cs.GR econ.EM stat.ME

    F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data

    Authors: Zexing Xu, Linjun Zhang, Sitan Yang, Rasoul Etesami, Hanghang Tong, Huan Zhang, Jiawei Han

    Abstract: Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns f… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    MSC Class: 68T07; 68T05; 62M10; 62M20; 90C90; 91B84

  35. arXiv:2406.15910  [pdf, other

    cs.CV

    Soft Masked Mamba Diffusion Model for CT to MRI Conversion

    Authors: Zhenbin Wang, Lei Zhang, Lituan Wang, Zhenwei Zhang

    Abstract: Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) are the predominant modalities utilized in the field of medical imaging. Although MRI capture the complexity of anatomical structures with greater detail than CT, it entails a higher financial costs and requires longer image acquisition times. In this study, we aim to train latent diffusion model for CT to MRI conversion, replacing the… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  36. Single-Temporal Supervised Learning for Universal Remote Sensing Change Detection

    Authors: Zhuo Zheng, Yanfei Zhong, Ailong Ma, Liangpei Zhang

    Abstract: Bitemporal supervised learning paradigm always dominates remote sensing change detection using numerous labeled bitemporal image pairs, especially for high spatial resolution (HSR) remote sensing imagery. However, it is very expensive and labor-intensive to label change regions in large-scale bitemporal HSR remote sensing image pairs. In this paper, we propose single-temporal supervised learning (… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: IJCV 2024. arXiv admin note: text overlap with arXiv:2108.07002

  37. arXiv:2406.15093  [pdf, other

    cs.CR cs.CV eess.IV

    ECLIPSE: Expunging Clean-label Indiscriminate Poisons via Sparse Diffusion Purification

    Authors: Xianlong Wang, Shengshan Hu, Yechao Zhang, Ziqi Zhou, Leo Yu Zhang, Peng Xu, Wei Wan, Hai Jin

    Abstract: Clean-label indiscriminate poisoning attacks add invisible perturbations to correctly labeled training images, thus dramatically reducing the generalization capability of the victim models. Recently, some defense mechanisms have been proposed such as adversarial training, image transformation techniques, and image purification. However, these schemes are either susceptible to adaptive attacks, bui… ▽ More

    Submitted 24 June, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

    Comments: Accepted by ESORICS 2024

  38. arXiv:2406.15000  [pdf, other

    cs.CL cs.AI

    Unveiling the Impact of Multi-Modal Interactions on User Engagement: A Comprehensive Evaluation in AI-driven Conversations

    Authors: Lichao Zhang, Jia Yu, Shuai Zhang, Long Li, Yangyang Zhong, Guanbao Liang, Yuming Yan, Qing Ma, Fangsheng Weng, Fayu Pan, Jing Li, Renjun Xu, Zhenzhong Lan

    Abstract: Large Language Models (LLMs) have significantly advanced user-bot interactions, enabling more complex and coherent dialogues. However, the prevalent text-only modality might not fully exploit the potential for effective user engagement. This paper explores the impact of multi-modal interactions, which incorporate images and audio alongside text, on user engagement in chatbot conversations. We cond… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  39. arXiv:2406.14969  [pdf, other

    cs.LG cs.AI

    Uni-Mol2: Exploring Molecular Pretraining Model at Scale

    Authors: Xiaohong Ji, Zhen Wang, Zhifeng Gao, Hang Zheng, Linfeng Zhang, Guolin Ke, Weinan E

    Abstract: In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining mo… ▽ More

    Submitted 1 July, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

  40. arXiv:2406.14958  [pdf, other

    cs.CV

    Skip and Skip: Segmenting Medical Images with Prompts

    Authors: Jiawei Chen, Dingkang Yang, Yuxuan Lei, Lihua Zhang

    Abstract: Most medical image lesion segmentation methods rely on hand-crafted accurate annotations of the original image for supervised learning. Recently, a series of weakly supervised or unsupervised methods have been proposed to reduce the dependence on pixel-level annotations. However, these methods are essentially based on pixel-level annotation, ignoring the image-level diagnostic results of the curre… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: Work in progress

  41. arXiv:2406.14763  [pdf, other

    cs.CL cs.AI

    A Learn-Then-Reason Model Towards Generalization in Knowledge Base Question Answering

    Authors: Lingxi Zhang, Jing Zhang, Yanling Wang, Cuiping Li, Hong Chen

    Abstract: Large-scale knowledge bases (KBs) like Freebase and Wikidata house millions of structured knowledge. Knowledge Base Question Answering (KBQA) provides a user-friendly way to access these valuable KBs via asking natural language questions. In order to improve the generalization capabilities of KBQA models, extensive research has embraced a retrieve-then-reason framework to retrieve relevant evidenc… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  42. arXiv:2406.14556  [pdf, other

    cs.RO cs.CV

    Asynchronous Large Language Model Enhanced Planner for Autonomous Driving

    Authors: Yuan Chen, Zi-han Ding, Ziqin Wang, Yan Wang, Lijun Zhang, Si Liu

    Abstract: Despite real-time planners exhibiting remarkable performance in autonomous driving, the growing exploration of Large Language Models (LLMs) has opened avenues for enhancing the interpretability and controllability of motion planning. Nevertheless, LLM-based planners continue to encounter significant challenges, including elevated resource consumption and extended inference times, which pose substa… ▽ More

    Submitted 21 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

  43. arXiv:2406.14319  [pdf, other

    cs.AI cs.CL

    LiveMind: Low-latency Large Language Models with Simultaneous Inference

    Authors: Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li

    Abstract: In this paper, we introduce a novel low-latency inference framework for large language models (LLMs) inference which enables LLMs to perform inferences with incomplete prompts. By reallocating computational processes to prompt input phase, we achieve a substantial reduction in latency, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the v… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  44. Unifying Graph Convolution and Contrastive Learning in Collaborative Filtering

    Authors: Yihong Wu, Le Zhang, Fengran Mo, Tianyu Zhu, Weizhi Ma, Jian-Yun Nie

    Abstract: Graph-based models and contrastive learning have emerged as prominent methods in Collaborative Filtering (CF). While many existing models in CF incorporate these methods in their design, there seems to be a limited depth of analysis regarding the foundational principles behind them. This paper bridges graph convolution, a pivotal element of graph-based models, with contrastive learning through a t… ▽ More

    Submitted 21 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: KDD 2024

  45. arXiv:2406.13919  [pdf, other

    cs.AI

    SPL: A Socratic Playground for Learning Powered by Large Language Model

    Authors: Liang Zhang, Jionghao Lin, Ziyi Kuang, Sheng Xu, Mohammed Yeasin, Xiangen Hu

    Abstract: Dialogue-based Intelligent Tutoring Systems (ITSs) have significantly advanced adaptive and personalized learning by automating sophisticated human tutoring strategies within interactive dialogues. However, replicating the nuanced patterns of expert human communication remains a challenge in Natural Language Processing (NLP). Recent advancements in NLP, particularly Large Language Models (LLMs) su… ▽ More

    Submitted 20 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

  46. arXiv:2406.13897  [pdf, other

    cs.CV

    CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

    Authors: Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, Jingyi Yu

    Abstract: In the realm of digital creativity, our potential to craft intricate 3D worlds from imagination is often hampered by the limitations of existing digital tools, which demand extensive expertise and efforts. To narrow this disparity, we introduce CLAY, a 3D geometry and material generator designed to effortlessly transform human imagination into intricate 3D digital structures. CLAY supports classic… ▽ More

    Submitted 30 May, 2024; originally announced June 2024.

    Comments: Project page: https://sites.google.com/view/clay-3dlm Video: https://youtu.be/YcKFp4U2Voo

  47. arXiv:2406.13660  [pdf, other

    cs.CL cs.AI

    Towards Minimal Targeted Updates of Language Models with Targeted Negative Training

    Authors: Lily H. Zhang, Rajesh Ranganath, Arya Tafvizi

    Abstract: Generative models of language exhibit impressive capabilities but still place non-negligible probability mass over undesirable outputs. In this work, we address the task of updating a model to avoid unwanted outputs while minimally changing model behavior otherwise, a challenge we refer to as a minimal targeted update. We first formalize the notion of a minimal targeted update and propose a method… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Published in Transactions of Machine Learning Research

  48. arXiv:2406.13133  [pdf, other

    cs.CL cs.LG q-bio.GN

    PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model

    Authors: Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang

    Abstract: Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 9 pages, 3 figures

  49. arXiv:2406.12742  [pdf, other

    cs.CV cs.AI cs.CL

    Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

    Authors: Bingchen Zhao, Yongshuo Zong, Letian Zhang, Timothy Hospedales

    Abstract: The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: First three authors contributed equally. Dataset: https://huggingface.co/datasets/VLLMs/MIRB

  50. arXiv:2406.12641  [pdf, other

    cs.CL

    DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence?

    Authors: Zhouhong Gu, Lin Zhang, Xiaoxuan Zhu, Jiangjie Chen, Wenhao Huang, Yikai Zhang, Shusen Wang, Zheyu Ye, Yan Gao, Hongwei Feng, Yanghua Xiao

    Abstract: Detecting evidence within the context is a key step in the process of reasoning task. Evaluating and enhancing the capabilities of LLMs in evidence detection will strengthen context-based reasoning performance. This paper proposes a benchmark called DetectBench for verifying the ability to detect and piece together implicit evidence within a long context. DetectBench contains 3,928 multiple-choice… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.