Skip to main content

Showing 1–50 of 355 results for author: Xie, Z

  1. arXiv:2407.02869  [pdf, other

    cs.SD eess.AS

    PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

    Authors: Zeyu Xie, Xuenan Xu, Zhizheng Wu, Mengyue Wu

    Abstract: Recently, audio generation tasks have attracted considerable research interests. Precise temporal controllability is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. PicoAudio integrates temporal information to guide audio generation through tailored model design. It leverages data crawling, segmen… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    MSC Class: 68Txx ACM Class: I.2

  2. arXiv:2407.02857  [pdf, other

    cs.SD eess.AS

    AudioTime: A Temporally-aligned Audio-text Benchmark Dataset

    Authors: Zeyu Xie, Xuenan Xu, Zhizheng Wu, Mengyue Wu

    Abstract: Recent advancements in audio generation have enabled the creation of high-fidelity audio clips from free-form textual descriptions. However, temporal relationships, a critical feature for audio content, are currently underrepresented in mainstream models, resulting in an imprecise temporal controllability. Specifically, users cannot accurately control the timestamps of sound events using free-form… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    MSC Class: 68Txx ACM Class: I.2

  3. arXiv:2407.02386  [pdf, other

    cs.CV

    OpenSlot: Mixed Open-set Recognition with Object-centric Learning

    Authors: Xu Yin, Fei Pan, Guoyuan An, Yuchi Huo, Zixuan Xie, Sung-Eui Yoon

    Abstract: Existing open-set recognition (OSR) studies typically assume that each image contains only one class label, and the unknown test set (negative) has a disjoint label space from the known test set (positive), a scenario termed full-label shift. This paper introduces the mixed OSR problem, where test images contain multiple class semantics, with known and unknown classes co-occurring in negatives, le… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: This study is under IEEE TMM review

  4. arXiv:2407.01511  [pdf, other

    cs.AI

    CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

    Authors: Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li

    Abstract: The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the compl… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  5. arXiv:2407.00144  [pdf, other

    cs.RO

    SCOPE: Stochastic Cartographic Occupancy Prediction Engine for Uncertainty-Aware Dynamic Navigation

    Authors: Zhanteng Xie, Philip Dames

    Abstract: This article presents a family of Stochastic Cartographic Occupancy Prediction Engines (SCOPEs) that enable mobile robots to predict the future states of complex dynamic environments. They do this by accounting for the motion of the robot itself, the motion of dynamic objects, and the geometry of static objects in the scene, and they generate a range of possible future states of the environment. T… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

  6. arXiv:2406.19101  [pdf, other

    cs.CV

    DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

    Authors: Jiaxin Zhang, Wentao Yang, Songxuan Lai, Zecheng Xie, Lianwen Jin

    Abstract: Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception, it also leads to longer sequences of visual t… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  7. arXiv:2406.17260  [pdf, other

    cs.CL

    Mitigating Hallucination in Fictional Character Role-Play

    Authors: Nafis Sadeq, Zhouhang Xie, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley

    Abstract: Role-playing has wide-ranging applications in customer support, embodied agents, computational social science, etc. The influence of parametric world knowledge of large language models (LLMs) often causes role-playing characters to act out of character and hallucinate about things outside the scope of their knowledge. In this work, we focus on the evaluation and mitigation of hallucination in fict… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  8. arXiv:2406.14979  [pdf, other

    cs.CL

    Retrieve-Plan-Generation: An Iterative Planning and Answering Framework for Knowledge-Intensive LLM Generation

    Authors: Yuanjie Lyu, Zihan Niu, Zheyong Xie, Chao Zhang, Tong Xu, Yang Wang, Enhong Chen

    Abstract: Despite the significant progress of large language models (LLMs) in various tasks, they often produce factual errors due to their limited internal knowledge. Retrieval-Augmented Generation (RAG), which enhances LLMs with external knowledge sources, offers a promising solution. However, these methods can be misled by irrelevant paragraphs in retrieved documents. Due to the inherent uncertainty in L… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  9. arXiv:2406.14928  [pdf, other

    cs.AI cs.CL cs.HC cs.MA cs.SI

    Autonomous Agents for Collaborative Task under Information Asymmetry

    Authors: Wei Liu, Chenxi Wang, Yifei Wang, Zihao Xie, Rennai Qiu, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Chen Qian

    Abstract: Large Language Model Multi-Agent Systems (LLM-MAS) have achieved great progress in solving complex tasks. It performs communication among agents within the system to collaboratively solve tasks, under the premise of shared information. However, when agents' communication is leveraged to enhance human cooperation, a new challenge arises due to information asymmetry, since each agent can only access… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: 16 pages, 8 figures, 5 tables, Work in progress

  10. arXiv:2406.14393  [pdf, other

    cs.LG cs.CL

    Jailbreaking as a Reward Misspecification Problem

    Authors: Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong

    Abstract: The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and d… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  11. arXiv:2406.13618  [pdf, other

    cs.CL

    In-Context Former: Lightning-fast Compressing Context for Large Language Model

    Authors: Xiangfeng Wang, Zaiyi Chen, Zheyong Xie, Tong Xu, Yongyi He, Enhong Chen

    Abstract: With the rising popularity of Transformer-based large language models (LLMs), reducing their high inference costs has become a significant research focus. One effective approach is to compress the long input contexts. Existing methods typically leverage the self-attention mechanism of the LLM itself for context compression. While these methods have achieved notable results, the compression process… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  12. arXiv:2406.13351  [pdf, other

    cs.LG cs.AI cs.DC

    A Resource-Adaptive Approach for Federated Learning under Resource-Constrained Environments

    Authors: Ruirui Zhang, Xingze Wu, Yifei Zou, Zhenzhen Xie, Peng Li, Xiuzhen Cheng, Dongxiao Yu

    Abstract: The paper studies a fundamental federated learning (FL) problem involving multiple clients with heterogeneous constrained resources. Compared with the numerous training parameters, the computing and communication resources of clients are insufficient for fast local training and real-time knowledge sharing. Besides, training on clients with heterogeneous resources may result in the straggler proble… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  13. arXiv:2406.11931  [pdf, other

    cs.SE cs.AI cs.LG

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    Authors: DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen , et al. (15 additional authors not shown)

    Abstract: We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathe… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  14. PIG: Prompt Images Guidance for Night-Time Scene Parsing

    Authors: Zhifeng Xie, Rui Qiu, Sen Wang, Xin Tan, Yuan Xie, Lizhuang Ma

    Abstract: Night-time scene parsing aims to extract pixel-level semantic information in night images, aiding downstream tasks in understanding scene object distribution. Due to limited labeled night image datasets, unsupervised domain adaptation (UDA) has become the predominant method for studying night scenes. UDA typically relies on paired day-night image pairs to guide adaptation, but this approach hamper… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: This paper is accepted by IEEE TIP. Code: https://github.com/qiurui4shu/PIG

  15. arXiv:2406.08979  [pdf, other

    cs.CL cs.AI cs.MA cs.SE

    Multi-Agent Software Development through Cross-Team Collaboration

    Authors: Zhuoyun Du, Chen Qian, Wei Liu, Zihao Xie, Yifei Wang, Yufan Dang, Weize Chen, Cheng Yang

    Abstract: The latest breakthroughs in Large Language Models (LLMs), eg., ChatDev, have catalyzed profound transformations, particularly through multi-agent collaboration for software development. LLM agents can collaborate in teams like humans, and follow the waterfall model to sequentially work on requirements analysis, development, review, testing, and other phases to perform autonomous software generatio… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Work in progress

  16. arXiv:2406.08052  [pdf, other

    cs.SD eess.AS

    FakeSound: Deepfake General Audio Detection

    Authors: Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu

    Abstract: With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset n… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

    MSC Class: 68Txx ACM Class: I.2

  17. arXiv:2406.07645  [pdf, other

    cs.CV cs.MM

    SSNVC: Single Stream Neural Video Compression with Implicit Temporal Information

    Authors: Feng Wang, Haihang Ruan, Zhihuang Xie, Ronggang Wang, Xiangyu Yue

    Abstract: Recently, Neural Video Compression (NVC) techniques have achieved remarkable performance, even surpassing the best traditional lossy video codec. However, most existing NVC methods heavily rely on transmitting Motion Vector (MV) to generate accurate contextual features, which has the following drawbacks. (1) Compressing and transmitting MV requires specialized MV encoder and decoder, which makes m… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by DCC 2024 as Poster. This is the full paper

  18. arXiv:2406.07155  [pdf, other

    cs.AI cs.CL cs.MA cs.NI cs.SI

    Scaling Large-Language-Model-based Multi-Agent Collaboration

    Authors: Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, Maosong Sun

    Abstract: Pioneering advancements in large language model-powered agents have underscored the design pattern of multi-agent collaboration, demonstrating that collective intelligence can surpass the capabilities of each individual. Inspired by the neural scaling law, which posits that increasing neurons leads to emergent abilities, this study investigates whether a similar principle applies to increasing age… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Work in progress; The code and data will be available at https://github.com/OpenBMB/ChatDev

  19. arXiv:2406.05392  [pdf, other

    cs.CL cs.AI cs.CY cs.LG

    Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas

    Authors: Chengyuan Deng, Yiqun Duan, Xin Jin, Heng Chang, Yijun Tian, Han Liu, Henry Peng Zou, Yiqiao Jin, Yijia Xiao, Yichen Wang, Shenghao Wu, Zongxing Xie, Kuofeng Gao, Sihong He, Jun Zhuang, Lu Cheng, Haohan Wang

    Abstract: Large Language Models (LLMs) have achieved unparalleled success across diverse language modeling tasks in recent years. However, this progress has also intensified ethical concerns, impacting the deployment of LLMs in everyday contexts. This paper provides a comprehensive survey of ethical challenges associated with LLMs, from longstanding issues such as copyright infringement, systematic bias, an… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

  20. arXiv:2406.03718  [pdf, other

    cs.CR cs.AI cs.CL

    Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning

    Authors: Xiaohu Du, Ming Wen, Jiahao Zhu, Zifan Xie, Bin Ji, Huijun Liu, Xuanhua Shi, Hai Jin

    Abstract: Code Pre-trained Models (CodePTMs) based vulnerability detection have achieved promising results over recent years. However, these models struggle to generalize as they typically learn superficial mapping from source code to labels instead of understanding the root causes of code vulnerabilities, resulting in poor performance in real-world scenarios beyond the training instances. To tackle this ch… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 Findings

  21. arXiv:2406.02006  [pdf, other

    math.OC cs.AI

    ODE-based Learning to Optimize

    Authors: Zhonglin Xie, Wotao Yin, Zaiwen Wen

    Abstract: Recent years have seen a growing interest in understanding acceleration methods through the lens of ordinary differential equations (ODEs). Despite the theoretical advancements, translating the rapid convergence observed in continuous-time models to discrete-time iterative methods poses significant challenges. In this paper, we present a comprehensive framework integrating the inertial systems wit… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 55 pages, 28 figures

  22. arXiv:2406.01152  [pdf, other

    cs.RO

    Learning-based legged locomotion; state of the art and future perspectives

    Authors: Sehoon Ha, Joonho Lee, Michiel van de Panne, Zhaoming Xie, Wenhao Yu, Majid Khadiv

    Abstract: Legged locomotion holds the premise of universal mobility, a critical capability for many real-world robotic applications. Both model-based and learning-based approaches have advanced the field of legged locomotion in the past three decades. In recent years, however, a number of factors have dramatically accelerated progress in learning-based methods, including the rise of deep learning, rapid pro… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  23. arXiv:2406.01059  [pdf, other

    cs.CV

    VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model

    Authors: Jinze Yang, Haoran Wang, Zining Zhu, Chenglong Liu, Meng Wymond Wu, Zeke Xie, Zhong Ji, Jungong Han, Mingming Sun

    Abstract: In this paper, we focus on resolving the problem of image outpainting, which aims to extrapolate the surrounding parts given the center contents of an image. Although recent works have achieved promising performance, the lack of versatility and customization hinders their practical applications in broader scenarios. Therefore, this work presents a novel image outpainting framework that is capable… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 15 pages

  24. arXiv:2406.00960  [pdf, other

    cs.GR cs.RO

    PDP: Physics-Based Character Animation via Diffusion Policy

    Authors: Takara E. Truong, Michael Piseno, Zhaoming Xie, C. Karen Liu

    Abstract: Generating diverse and realistic human motion that can physically interact with an environment remains a challenging research area in character animation. Meanwhile, diffusion-based methods, as proposed by the robotics community, have demonstrated the ability to capture highly diverse and multi-modal skills. However, naively training a diffusion policy often results in unstable motions for high-fr… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

  25. arXiv:2406.00626  [pdf, other

    cs.MM cs.SD eess.AS

    Intelligent Text-Conditioned Music Generation

    Authors: Zhouyao Xie, Nikhil Yadala, Xinyi Chen, Jing Xi Liu

    Abstract: CLIP (Contrastive Language-Image Pre-Training) is a multimodal neural network trained on (text, image) pairs to predict the most relevant text caption given an image. It has been used extensively in image generation by connecting its output with a generative model such as VQGAN, with the most notable example being OpenAI's DALLE-2. In this project, we apply a similar approach to bridge the gap bet… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

  26. arXiv:2406.00243  [pdf, ps, other

    math.CO cs.IT math.AG

    There are no good infinite families of toric codes

    Authors: Jason P. Bell, Sean Monahan, Matthew Satriano, Karen Situ, Zheng Xie

    Abstract: Soprunov and Soprunova introduced the notion of a good infinite family of toric codes. We prove that such good families do not exist by proving a more general Szemerédi-type result: for all $c\in(0,1]$ and all positive integers $N$, subsets of density at least $c$ in $\{0,1,\dots,N-1\}^n$ contain hypercubes of arbitrarily large dimension as $n$ grows.

    Submitted 31 May, 2024; originally announced June 2024.

    Comments: 10 pages. Comments welcome

    MSC Class: 14G50; 14M25; 11B30; 94B05

  27. arXiv:2405.18975  [pdf, other

    cs.LG

    Hierarchical Classification Auxiliary Network for Time Series Forecasting

    Authors: Yanru Sun, Zongxia Xie, Dongyue Chen, Emadeldeen Eldele, Qinghua Hu

    Abstract: Deep learning has significantly advanced time series forecasting through its powerful capacity to capture sequence relationships. However, training these models with the Mean Square Error (MSE) loss often results in over-smooth predictions, making it challenging to handle the complexity and learn high-entropy features from time series data with high variability and unpredictability. In this work,… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  28. arXiv:2405.18711  [pdf, other

    cs.AI cs.CL

    Calibrating Reasoning in Language Models with Internal Consistency

    Authors: Zhihui Xie, Jizhou Guo, Tong Yu, Shuai Li

    Abstract: Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks, aided by techniques like chain-of-thought (CoT) prompting that elicits verbalized reasoning. However, LLMs often generate text with obvious mistakes and contradictions, raising doubts about their ability to robustly process and utilize generated rationales. In this work, we investigate CoT reasoning… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  29. arXiv:2405.17176  [pdf, other

    cs.GR cs.AI

    DreamMat: High-quality PBR Material Generation with Geometry- and Light-aware Diffusion Models

    Authors: Yuqing Zhang, Yuan Liu, Zhiyu Xie, Lei Yang, Zhongyuan Liu, Mengzhou Yang, Runze Zhang, Qilong Kou, Cheng Lin, Wenping Wang, Xiaogang Jin

    Abstract: 2D diffusion model, which often contains unwanted baked-in shading effects and results in unrealistic rendering effects in the downstream applications. Generating Physically Based Rendering (PBR) materials instead of just RGB textures would be a promising solution. However, directly distilling the PBR material parameters from 2D diffusion models still suffers from incorrect material decomposition,… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Accepted to SIGGRAPH 2024

  30. arXiv:2405.15638  [pdf, other

    cs.CV cs.CL

    M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

    Authors: Hongyu Wang, Jiayu Xu, Senwei Xie, Ruiping Wang, Jialin Li, Zhaojie Xie, Bin Zhang, Chuyan Xiong, Xilin Chen

    Abstract: Multilingual multimodal reasoning is a core component in achieving human-level intelligence. However, most existing benchmarks for multilingual multimodal reasoning struggle to differentiate between models of varying performance; even language models without visual capabilities can easily achieve high scores. This leaves a comprehensive evaluation of leading multilingual multimodal models largely… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: Work in progress

  31. arXiv:2405.15239  [pdf, other

    cs.CV

    Automating the Diagnosis of Human Vision Disorders by Cross-modal 3D Generation

    Authors: Li Zhang, Yuankun Yang, Ziyang Xie, Zhiyuan Yuan, Jianfeng Feng, Xiatian Zhu, Yu-Gang Jiang

    Abstract: Understanding the hidden mechanisms behind human's visual perception is a fundamental quest in neuroscience, underpins a wide variety of critical applications, e.g. clinical diagnosis. To that end, investigating into the neural responses of human mind activities, such as functional Magnetic Resonance Imaging (fMRI), has been a significant research vehicle. However, analyzing fMRI signals is challe… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: 25 pages, 16 figures, project page: https://brain-3d.github.io/

  32. arXiv:2405.14880  [pdf, other

    cs.CV cs.AI

    Dissecting Query-Key Interaction in Vision Transformers

    Authors: Xu Pan, Aaron Philip, Ziqian Xie, Odelia Schwartz

    Abstract: Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to dissimilar tokens can be beneficial by providing contextual information. We propose to use the Singular Value Decomposition to dissect the query-key interaction… ▽ More

    Submitted 26 May, 2024; v1 submitted 4 April, 2024; originally announced May 2024.

  33. arXiv:2405.14029  [pdf, ps, other

    cs.IT eess.SP

    Analog Beamforming Enabled Multicasting: Finite-Alphabet Inputs and Statistical CSI

    Authors: Yanjun Wu, Zhong Xie, Zhuochen Xie, Chongjun Ouyang, Xuwen Liang

    Abstract: The average multicast rate (AMR) is analyzed in a multicast channel utilizing analog beamforming with finite-alphabet inputs, considering statistical channel state information (CSI). New expressions for the AMR are derived for non-cooperative and cooperative multicasting scenarios. Asymptotic analyses are conducted in the high signal-to-noise ratio regime to derive the array gain and diversity ord… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 5 pages

  34. arXiv:2405.12119  [pdf, other

    cs.IR cs.AI cs.CL

    Reindex-Then-Adapt: Improving Large Language Models for Conversational Recommendation

    Authors: Zhankui He, Zhouhang Xie, Harald Steck, Dawen Liang, Rahul Jha, Nathan Kallus, Julian McAuley

    Abstract: Large language models (LLMs) are revolutionizing conversational recommender systems by adeptly indexing item content, understanding complex conversational contexts, and generating relevant item titles. However, controlling the distribution of recommended items remains a challenge. This leads to suboptimal performance due to the failure to capture rapidly changing data distributions, such as item p… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

  35. arXiv:2405.04434  [pdf, other

    cs.CL cs.AI

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Authors: DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding , et al. (132 additional authors not shown)

    Abstract: We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference… ▽ More

    Submitted 19 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

  36. arXiv:2405.04219  [pdf, other

    cs.CL cs.AI cs.MA cs.SE

    Iterative Experience Refinement of Software-Developing Agents

    Authors: Chen Qian, Jiahao Li, Yufan Dang, Wei Liu, YiFei Wang, Zihao Xie, Weize Chen, Cheng Yang, Yingli Zhang, Zhiyuan Liu, Maosong Sun

    Abstract: Autonomous agents powered by large language models (LLMs) show significant potential for achieving high autonomy in various scenarios such as software development. Recent research has shown that LLM agents can leverage past experiences to reduce errors and enhance efficiency. However, the static experience paradigm, reliant on a fixed collection of past experiences acquired heuristically, lacks it… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Work in progress

  37. arXiv:2405.03194  [pdf, other

    cs.CV

    CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario

    Authors: Zhizhao Duan, Hao Cheng, Duo Xu, Xi Wu, Xiangxie Zhang, Xi Ye, Zhen Xie

    Abstract: In the vast and dynamic landscape of urban settings, Traffic Safety Description and Analysis plays a pivotal role in applications ranging from insurance inspection to accident prevention. This paper introduces CityLLaVA, a novel fine-tuning framework for Visual Language Models (VLMs) designed for urban scenarios. CityLLaVA enhances model comprehension and prediction accuracy through (1) employing… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: Accepted by AICITY2024 Workshop Track2 at CVPR2024

  38. arXiv:2405.03098  [pdf, other

    cs.CL

    FairMonitor: A Dual-framework for Detecting Stereotypes and Biases in Large Language Models

    Authors: Yanhong Bai, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xingjiao Wu, Liang He

    Abstract: Detecting stereotypes and biases in Large Language Models (LLMs) is crucial for enhancing fairness and reducing adverse impacts on individuals or groups when these models are applied. Traditional methods, which rely on embedding spaces or are based on probability metrics, fall short in revealing the nuanced and implicit biases present in various contexts. To address this challenge, we propose the… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

  39. arXiv:2405.01771  [pdf, other

    cs.RO

    Towards Predicting Collective Performance in Multi-Robot Teams

    Authors: Pujie Xin, Zhanteng Xie, Philip Dames

    Abstract: The increased deployment of multi-robot systems (MRS) in various fields has led to the need for analysis of system-level performance. However, creating consistent metrics for MRS is challenging due to the wide range of system and environmental factors, such as team size and environment size. This paper presents a new analytical framework for MRS based on dimensionless variable analysis, a mathemat… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

  40. arXiv:2404.15807  [pdf, other

    cs.CL

    One Subgraph for All: Efficient Reasoning on Opening Subgraphs for Inductive Knowledge Graph Completion

    Authors: Zhiwen Xie, Yi Zhang, Guangyou Zhou, Jin Liu, Xinhui Tu, Jimmy Xiangji Huang

    Abstract: Knowledge Graph Completion (KGC) has garnered massive research interest recently, and most existing methods are designed following a transductive setting where all entities are observed during training. Despite the great progress on the transductive KGC, these methods struggle to conduct reasoning on emerging KGs involving unseen entities. Thus, inductive KGC, which aims to deduce missing links am… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  41. arXiv:2404.14951  [pdf, other

    cs.CV

    Reconstructing the Image Stitching Pipeline: Integrating Fusion and Rectangling into a Unified Inpainting Model

    Authors: Ziqi Xie, Weidong Zhao, Xianhui Liu, Jian Zhao, Ning Jia

    Abstract: Deep learning-based image stitching pipelines are typically divided into three cascading stages: registration, fusion, and rectangling. Each stage requires its own network training and is tightly coupled to the others, leading to error propagation and posing significant challenges to parameter tuning and system stability. This paper proposes the Simple and Robust Stitcher (SRStitcher), which revol… ▽ More

    Submitted 26 May, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

  42. arXiv:2404.12387  [pdf, other

    cs.CL cs.CV

    Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

    Authors: Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu , et al. (1 additional authors not shown)

    Abstract: We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but al… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

  43. arXiv:2404.11467  [pdf, other

    cs.SE cs.CR

    A Large-scale Fine-grained Analysis of Packages in Open-Source Software Ecosystems

    Authors: Xiaoyan Zhou, Feiran Liang, Zhaojie Xie, Yang Lan, Wenjia Niu, Jiqiang Liu, Haining Wang, Qiang Li

    Abstract: Package managers such as NPM, Maven, and PyPI play a pivotal role in open-source software (OSS) ecosystems, streamlining the distribution and management of various freely available packages. The fine-grained details within software packages can unveil potential risks within existing OSS ecosystems, offering valuable insights for detecting malicious packages. In this study, we undertake a large-sca… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  44. Trustworthy Multimodal Fusion for Sentiment Analysis in Ordinal Sentiment Space

    Authors: Zhuyang Xie, Yan Yang, Jie Wang, Xiaorong Liu, Xiaofan Li

    Abstract: Multimodal video sentiment analysis aims to integrate multiple modal information to analyze the opinions and attitudes of speakers. Most previous work focuses on exploring the semantic interactions of intra- and inter-modality. However, these works ignore the reliability of multimodality, i.e., modalities tend to contain noise, semantic ambiguity, missing modalities, etc. In addition, previous mul… ▽ More

    Submitted 13 April, 2024; originally announced April 2024.

    Comments: 14 pages, 9 figures, Accepted by IEEE Transactions on Circuits and Systems for Video Technology

  45. arXiv:2404.08648  [pdf

    cs.NI cs.ET physics.optics

    Software-defined optical networking applications enabled by programmable integrated photonics

    Authors: Zhenyun Xie, David Sánchez-Jácome, Luis Torrijos-Morán, Daniel Pérez-López

    Abstract: Data center networks are experiencing unprecedented exponential growth, mostly driven by the continuous computing demands in machine learning and artificial intelligence algorithms. Within this realm, optical networking offers numerous advantages, including low latency, energy efficiency, and bandwidth transparency, positioning it as a compelling alternative to its electronic counterparts. In this… ▽ More

    Submitted 4 March, 2024; originally announced April 2024.

  46. Accel-NASBench: Sustainable Benchmarking for Accelerator-Aware NAS

    Authors: Afzal Ahmad, Linfeng Du, Zhiyao Xie, Wei Zhang

    Abstract: One of the primary challenges impeding the progress of Neural Architecture Search (NAS) is its extensive reliance on exorbitant computational resources. NAS benchmarks aim to simulate runs of NAS experiments at zero cost, remediating the need for extensive compute. However, existing NAS benchmarks use synthetic datasets and model proxies that make simplified assumptions about the characteristics o… ▽ More

    Submitted 18 June, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: Accepted at Design Automation Conference DAC'24

  47. arXiv:2403.20079  [pdf, other

    cs.CV

    SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior

    Authors: Zhongrui Yu, Haoran Wang, Jinze Yang, Hanzhang Wang, Zeke Xie, Yunfeng Cai, Jiale Cao, Zhong Ji, Mingming Sun

    Abstract: Novel View Synthesis (NVS) for street scenes play a critical role in the autonomous driving simulation. The current mainstream technique to achieve it is neural rendering, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Although thrilling progress has been made, when handling street scenes, current methods struggle to maintain rendering quality at the viewpoint that deviate… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

  48. arXiv:2403.19213  [pdf, other

    cs.CV

    Learning Multiple Representations with Inconsistency-Guided Detail Regularization for Mask-Guided Matting

    Authors: Weihao Jiang, Zhaozhi Xie, Yuxiang Lu, Longjie Qi, Jingyong Cai, Hiroyuki Uchiyama, Bin Chen, Yue Ding, Hongtao Lu

    Abstract: Mask-guided matting networks have achieved significant improvements and have shown great potential in practical applications in recent years. However, simply learning matting representation from synthetic and lack-of-real-world-diversity matting data, these approaches tend to overfit low-level details in wrong regions, lack generalization to objects with complex structures and real-world scenes su… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

  49. arXiv:2403.18453  [pdf, other

    cs.AR

    Annotating Slack Directly on Your Verilog: Fine-Grained RTL Timing Evaluation for Early Optimization

    Authors: Wenji Fang, Shang Liu, Hongce Zhang, Zhiyao Xie

    Abstract: In digital IC design, compared with post-synthesis netlists or layouts, the early register-transfer level (RTL) stage offers greater optimization flexibility for both designers and EDA tools. However, timing information is typically unavailable at this early stage. Some recent machine learning (ML) solutions propose to predict the total negative slack (TNS) and worst negative slack (WNS) of an ent… ▽ More

    Submitted 6 May, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

    Comments: Published as a conference paper at Design Automation Conference (DAC) 2024

  50. arXiv:2403.17615  [pdf, other

    eess.IV cs.CV q-bio.QM

    Grad-CAMO: Learning Interpretable Single-Cell Morphological Profiles from 3D Cell Painting Images

    Authors: Vivek Gopalakrishnan, Jingzhe Ma, Zhiyong Xie

    Abstract: Despite their black-box nature, deep learning models are extensively used in image-based drug discovery to extract feature vectors from single cells in microscopy images. To better understand how these networks perform representation learning, we employ visual explainability techniques (e.g., Grad-CAM). Our analyses reveal several mechanisms by which supervised models cheat, exploiting biologicall… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.