Skip to main content

Showing 1–50 of 552 results for author: Pan, J

  1. arXiv:2407.03227  [pdf, other

    cs.CL cs.AI cs.DB

    Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning

    Authors: Zhili Shen, Pavlos Vougiouklis, Chenxin Diao, Kaustubh Vyas, Yuanyi Ji, Jeff Z. Pan

    Abstract: We focus on Text-to-SQL semantic parsing from the perspective of Large Language Models. Motivated by challenges related to the size of commercial database schemata and the deployability of business intelligence solutions, we propose an approach that dynamically retrieves input database information and uses abstract syntax trees to select few-shot examples for in-context learning. Furthermore, we… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  2. arXiv:2407.01009  [pdf, other

    cs.CL

    DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models

    Authors: Jiabao Pan, Yan Zhang, Chen Zhang, Zuozhu Liu, Hongwei Wang, Haizhou Li

    Abstract: Large language models (LLMs) have demonstrated emergent capabilities across diverse reasoning tasks via popular Chains-of-Thought (COT) prompting. However, such a simple and fast COT approach often encounters limitations in dealing with complicated problems, while a thorough method, which considers multiple reasoning pathways and verifies each step carefully, results in slower inference. This pape… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  3. arXiv:2407.00782  [pdf, other

    cs.CL

    Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

    Authors: Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li

    Abstract: Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step… ▽ More

    Submitted 2 July, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

  4. arXiv:2407.00769  [pdf, other

    quant-ph cs.DC

    Achieving Energetic Superiority Through System-Level Quantum Circuit Simulation

    Authors: Rong Fu, Zhongling Su, Han-Sen Zhong, Xiti Zhao, Jianyang Zhang, Feng Pan, Pan Zhang, Xianhe Zhao, Ming-Cheng Chen, Chao-Yang Lu, Jian-Wei Pan, Zhiling Pei, Xingcheng Zhang, Wanli Ouyang

    Abstract: Quantum Computational Superiority boasts rapid computation and high energy efficiency. Despite recent advances in classical algorithms aimed at refuting the milestone claim of Google's sycamore, challenges remain in generating uncorrelated samples of random quantum circuits. In this paper, we present a groundbreaking large-scale system technology that leverages optimization on global, node, and de… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  5. arXiv:2406.18193  [pdf, ps, other

    cs.CV cs.AI

    MammothModa: Multi-Modal Large Language Model

    Authors: Qi She, Junwen Pan, Xin Wan, Rui Zhang, Dawei Lu, Kai Huang

    Abstract: In this report, we introduce MammothModa, yet another multi-modal large language model (MLLM) designed to achieve state-of-the-art performance starting from an elementary baseline. We focus on three key design insights: (i) Integrating Visual Capabilities while Maintaining Complex Language Understanding: In addition to the vision encoder, we incorporated the Visual Attention Experts into the LLM t… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Technical report

  6. Start from Zero: Triple Set Prediction for Automatic Knowledge Graph Completion

    Authors: Wen Zhang, Yajing Xu, Peng Ye, Zhiwei Huang, Zezhong Xu, Jiaoyan Chen, Jeff Z. Pan, Huajun Chen

    Abstract: Knowledge graph (KG) completion aims to find out missing triples in a KG. Some tasks, such as link prediction and instance completion, have been proposed for KG completion. They are triple-level tasks with some elements in a missing triple given to predict the missing element of the triple. However, knowing some elements of the missing triple in advance is not always a realistic setting. In this p… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Paper accepted by TKDE in 2024

  7. arXiv:2406.16872  [pdf, other

    eess.SP cs.AI

    Multi-channel Time Series Decomposition Network For Generalizable Sensor-Based Activity Recognition

    Authors: Jianguo Pan, Zhengxin Hu, Lingdun Zhang, Xia Cai

    Abstract: Sensor-based human activity recognition is important in daily scenarios such as smart healthcare and homes due to its non-intrusive privacy and low cost advantages, but the problem of out-of-domain generalization caused by differences in focusing individuals and operating environments can lead to significant accuracy degradation on cross-person behavior recognition due to the inconsistent distribu… ▽ More

    Submitted 28 March, 2024; originally announced June 2024.

  8. arXiv:2406.14282  [pdf, other

    cs.CL cs.AI

    Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs

    Authors: Junjie Wang, Mingyang Chen, Binbin Hu, Dan Yang, Ziqi Liu, Yue Shen, Peng Wei, Zhiqiang Zhang, Jinjie Gu, Jun Zhou, Jeff Z. Pan, Wen Zhang, Huajun Chen

    Abstract: Improving the performance of large language models (LLMs) in complex question-answering (QA) scenarios has always been a research focal point. Recent studies have attempted to enhance LLMs' performance by combining step-wise planning with external retrieval. While effective for advanced models like GPT-3.5, smaller LLMs face challenges in decomposing complex questions, necessitating supervised fin… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Work in progress

  9. arXiv:2406.14136  [pdf, other

    cs.RO

    One Fling to Goal: Environment-aware Dynamics for Goal-conditioned Fabric Flinging

    Authors: Linhan Yang, Lei Yang, Haoran Sun, Zeqing Zhang, Haibin He, Fang Wan, Chaoyang Song, Jia Pan

    Abstract: Fabric manipulation dynamically is commonly seen in manufacturing and domestic settings. While dynamically manipulating a fabric piece to reach a target state is highly efficient, this task presents considerable challenges due to the varying properties of different fabrics, complex dynamics when interacting with environments, and meeting required goal conditions. To address these challenges, we pr… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  10. arXiv:2406.13607  [pdf, other

    cs.CV

    Ultra-High-Definition Restoration: New Benchmarks and A Dual Interaction Prior-Driven Solution

    Authors: Liyan Wang, Cong Wang, Jinshan Pan, Weixiang Zhou, Xiaoran Sun, Wei Wang, Zhixun Su

    Abstract: Ultra-High-Definition (UHD) image restoration has acquired remarkable attention due to its practical demand. In this paper, we construct UHD snow and rain benchmarks, named UHD-Snow and UHD-Rain, to remedy the deficiency in this field. The UHD-Snow/UHD-Rain is established by simulating the physics process of rain/snow into consideration and each benchmark contains 3200 degraded/clear image pairs o… ▽ More

    Submitted 22 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

  11. arXiv:2406.11896  [pdf, other

    cs.LG

    DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

    Authors: Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar

    Abstract: Training corpuses for vision language models (VLMs) typically lack sufficient amounts of decision-centric data. This renders off-the-shelf VLMs sub-optimal for decision-making tasks such as in-the-wild device control through graphical user interfaces (GUIs). While training with static demonstrations has shown some promise, we show that such methods fall short for controlling real GUIs due to their… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: 11 pages of main text, 28 pages in total

  12. arXiv:2406.10744   

    cs.CV

    Technique Report of CVPR 2024 PBDL Challenges

    Authors: Ying Fu, Yu Li, Shaodi You, Boxin Shi, Jose Alvarez, Coert van Gemeren, Linwei Chen, Yunhao Zou, Zichun Wang, Yichen Li, Yuze Han, Yingkai Zhang, Jianan Wang, Qinglin Liu, Wei Yu, Xiaoqian Lv, Jianing Li, Shengping Zhang, Xiangyang Ji, Yuanpei Chen, Yuhan Zhang, Weihang Peng, Liwen Zhang, Zhe Xu, Dingyong Gou , et al. (77 additional authors not shown)

    Abstract: The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, a… ▽ More

    Submitted 27 June, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

    Comments: The author list and contents need to be verified by all authors

  13. arXiv:2406.10093  [pdf, other

    cs.RO cs.LG

    BiKC: Keypose-Conditioned Consistency Policy for Bimanual Robotic Manipulation

    Authors: Dongjie Yu, Hang Xu, Yizhou Chen, Yi Ren, Jia Pan

    Abstract: Bimanual manipulation tasks typically involve multiple stages which require efficient interactions between two arms, posing step-wise and stage-wise challenges for imitation learning systems. Specifically, failure and delay of one step will broadcast through time, hinder success and efficiency of each sub-stage task, and thereby overall task performance. Although recent works have made strides in… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  14. arXiv:2406.05972  [pdf, other

    cs.AI cs.CY cs.HC cs.LG econ.TH

    Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context

    Authors: Jingru Jia, Zehua Yuan, Junhao Pan, Paul McNamara, Deming Chen

    Abstract: When making decisions under uncertainty, individuals often deviate from rational behavior, which can be evaluated across three dimensions: risk preference, probability weighting, and loss aversion. Given the widespread use of large language models (LLMs) in decision-making processes, it is crucial to assess whether their behavior aligns with human norms and ethical expectations or exhibits potenti… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Jingru Jia and Zehua Yuan has equal contribution

  15. arXiv:2406.05130  [pdf, other

    cs.CL

    An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

    Authors: Xiongtao Zhou, Jie He, Yuhua Ke, Guangyao Zhu, Víctor Gutiérrez-Basulto, Jeff Z. Pan

    Abstract: Multimodal large language models (MLLMs) fine-tuned with multimodal instruction datasets have demonstrated remarkable capabilities in multimodal tasks. However, fine-tuning all parameters of MLLMs has become challenging as they usually contain billions of parameters. To address this issue, we study parameter-efficient fine-tuning (PEFT) methods for MLLMs. We aim to identify effective methods for e… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: ACL finding 2024

  16. arXiv:2406.04321  [pdf, other

    cs.CV cs.LG cs.MM cs.SD

    VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

    Authors: Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

    Abstract: In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 190K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music t… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: The code and datasets will be available at https://github.com/ZeyueT/VidMuse/

  17. arXiv:2406.02430  [pdf, other

    eess.AS cs.SD

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

    Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  18. arXiv:2406.01028  [pdf, other

    cs.CV

    LLEMamba: Low-Light Enhancement via Relighting-Guided Mamba with Deep Unfolding Network

    Authors: Xuanqi Zhang, Haijin Zeng, Jinwang Pan, Qiangqiang Shen, Yongyong Chen

    Abstract: Transformer-based low-light enhancement methods have yielded promising performance by effectively capturing long-range dependencies in a global context. However, their elevated computational demand limits the scalability of multiple iterations in deep unfolding networks, and hence they have difficulty in flexibly balancing interpretability and distortion. To address this issue, we propose a novel… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 9pages, 7 figures

  19. arXiv:2406.00629  [pdf, other

    cs.CV

    Correlation Matching Transformation Transformers for UHD Image Restoration

    Authors: Cong Wang, Jinshan Pan, Wei Wang, Gang Fu, Siyuan Liang, Mengzhu Wang, Xiao-Ming Wu, Jun Liu

    Abstract: This paper proposes UHDformer, a general Transformer for Ultra-High-Definition (UHD) image restoration. UHDformer contains two learning spaces: (a) learning in high-resolution space and (b) learning in low-resolution space. The former learns multi-level high-resolution features and fuses low-high features and reconstructs the residual images, while the latter explores more representative features… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: AAAI-24; Source codes, datasets, visual results, and pre-trained models are: https://github.com/supersupercong/UHDformer

  20. arXiv:2406.00329  [pdf, other

    eess.IV cs.CV cs.LG

    Whole Heart 3D+T Representation Learning Through Sparse 2D Cardiac MR Images

    Authors: Yundi Zhang, Chen Chen, Suprosanna Shit, Sophie Starck, Daniel Rueckert, Jiazhen Pan

    Abstract: Cardiac Magnetic Resonance (CMR) imaging serves as the gold-standard for evaluating cardiac morphology and function. Typically, a multi-view CMR stack, covering short-axis (SA) and 2/3/4-chamber long-axis (LA) views, is acquired for a thorough cardiac assessment. However, efficiently streamlining the complex, high-dimensional 3D+T CMR data and distilling compact, coherent representation remains a… ▽ More

    Submitted 6 June, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

  21. arXiv:2406.00192  [pdf, other

    eess.IV cs.CV cs.LG

    Direct Cardiac Segmentation from Undersampled K-space Using Transformers

    Authors: Yundi Zhang, Nil Stolt-Ansó, Jiazhen Pan, Wenqi Huang, Kerstin Hammernik, Daniel Rueckert

    Abstract: The prevailing deep learning-based methods of predicting cardiac segmentation involve reconstructed magnetic resonance (MR) images. The heavy dependency of segmentation approaches on image quality significantly limits the acceleration rate in fast MR reconstruction. Moreover, the practice of treating reconstruction and segmentation as separate sequential processes leads to artifact generation and… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

  22. arXiv:2405.17493  [pdf, other

    cs.LG

    Overcoming Negative Transfer by Online Selection: Distant Domain Adaptation for Fault Diagnosis

    Authors: Ziyan Wang, Mohamed Ragab, Wenmian Yang, Min Wu, Sinno Jialin Pan, Jie Zhang, Zhenghua Chen

    Abstract: Unsupervised domain adaptation (UDA) has achieved remarkable success in fault diagnosis, bringing significant benefits to diverse industrial applications. While most UDA methods focus on cross-working condition scenarios where the source and target domains are notably similar, real-world applications often grapple with severe domain shifts. We coin the term `distant domain adaptation problem' to d… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

    Comments: 8 pages, 5 figures

  23. arXiv:2405.17158  [pdf, other

    cs.CV

    PatchScaler: An Efficient Patch-Independent Diffusion Model for Super-Resolution

    Authors: Yong Liu, Hang Dong, Jinshan Pan, Qingji Dong, Kai Chen, Rongxiang Zhang, Lean Fu, Fei Wang

    Abstract: Diffusion models significantly improve the quality of super-resolved images with their impressive content generation capabilities. However, the huge computational costs limit the applications of these methods.Recent efforts have explored reasonable inference acceleration to reduce the number of sampling steps, but the computational cost remains high as each step is performed on the entire image.Th… ▽ More

    Submitted 11 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

  24. arXiv:2405.17156  [pdf, other

    astro-ph.IM astro-ph.SR cs.LG

    The Scaling Law in Stellar Light Curves

    Authors: Jia-Shu Pan, Yuan-Sen Ting, Yang Huang, Jie Yu, Ji-Feng Liu

    Abstract: Analyzing time series of fluxes from stars, known as stellar light curves, can reveal valuable information about stellar properties. However, most current methods rely on extracting summary statistics, and studies using deep learning have been limited to supervised approaches. In this research, we investigate the scaling law properties that emerge when learning from astronomical time series data u… ▽ More

    Submitted 17 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

    Comments: 11 pages, 5 figures, ICML 2024 AI4Science workshop

  25. arXiv:2405.17074  [pdf, other

    cs.CV

    Towards Ultra-High-Definition Image Deraining: A Benchmark and An Efficient Method

    Authors: Hongming Chen, Xiang Chen, Chen Wu, Zhuoran Zheng, Jinshan Pan, Xianping Fu

    Abstract: Despite significant progress has been made in image deraining, existing approaches are mostly carried out on low-resolution images. The effectiveness of these methods on high-resolution images is still unknown, especially for ultra-high-definition (UHD) images, given the continuous advancement of imaging devices. In this paper, we focus on the task of UHD image deraining, and contribute the first… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  26. arXiv:2405.17057  [pdf, other

    cs.CL cs.AI

    ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation

    Authors: Houxing Ren, Mingjie Zhan, Zhongyuan Wu, Aojun Zhou, Junting Pan, Hongsheng Li

    Abstract: Code generation plays a crucial role in various tasks, such as code auto-completion and mathematical reasoning. Previous work has proposed numerous methods to enhance code generation performance, including integrating feedback from the compiler. Inspired by this, we present ReflectionCoder, a novel approach that effectively leverages reflection sequences constructed by integrating compiler feedbac… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  27. arXiv:2405.16874  [pdf, other

    cs.CV

    CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild

    Authors: Xingqun Qi, Hengyuan Zhang, Yatian Wang, Jiahao Pan, Chen Liu, Peng Li, Xiaowei Chi, Mengfei Li, Qixun Zhang, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, Yike Guo

    Abstract: Deriving co-speech 3D gestures has seen tremendous progress in virtual avatar animation. Yet, the existing methods often produce stiff and unreasonable gestures with unseen human speech inputs due to the limited 3D speech-gesture data. In this paper, we propose CoCoGesture, a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts. Our key insight is built upo… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: The dataset will be released as soon as possible

  28. arXiv:2405.15984  [pdf, other

    cs.CL cs.AI

    Evaluating the Adversarial Robustness of Retrieval-Based In-Context Learning for Large Language Models

    Authors: Simon Chi Lok Yu, Jie He, Pasquale Minervini, Jeff Z. Pan

    Abstract: With the emergence of large language models, such as LLaMA and OpenAI GPT-3, In-Context Learning (ICL) gained significant attention due to its effectiveness and efficiency. However, ICL is very sensitive to the choice, order, and verbaliser used to encode the demonstrations in the prompt. Retrieval-Augmented ICL methods try to address this problem by leveraging retrievers to extract semantically r… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: 29 pages, 6 figures

  29. arXiv:2405.15627  [pdf, other

    physics.class-ph cs.CE

    The Scattering Matrix-Based Characteristic Mode for Structure amidst Arbitrary Background: Theory, Benchmark and Applications

    Authors: Chenbo Shi, Jin Pan, Xin Gu, Shichen Liang, Le Zuo

    Abstract: This paper presents a novel approach for computing substructure characteristic modes. This method leverages electromagnetic scattering matrices and spherical wave expansion to directly decompose electromagnetic fields. Unlike conventional methods that rely on the impedance matrix generated by the method of moments (MoM), our technique simplifies the problem into a small-scale ordinary eigenvalue p… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  30. arXiv:2405.14343  [pdf, other

    cs.CV

    Efficient Visual State Space Model for Image Deblurring

    Authors: Lingshun Kong, Jiangxin Dong, Ming-Hsuan Yang, Jinshan Pan

    Abstract: Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration. ViTs typically yield superior results in image restoration compared to CNNs due to their ability to capture long-range dependencies and input-dependent characteristics. However, the computational complexity of Transformer-based models grows quadratically with the image reso… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  31. arXiv:2405.13602  [pdf, other

    cs.AI cs.CL cs.LG

    COTET: Cross-view Optimal Transport for Knowledge Graph Entity Typing

    Authors: Zhiwei Hu, Víctor Gutiérrez-Basulto, Zhiliang Xiang, Ru Li, Jeff Z. Pan

    Abstract: Knowledge graph entity typing (KGET) aims to infer missing entity type instances in knowledge graphs. Previous research has predominantly centered around leveraging contextual information associated with entities, which provides valuable clues for inference. However, they have long ignored the dual nature of information inherent in entities, encompassing both high-level coarse-grained cluster know… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  32. arXiv:2405.12706  [pdf, other

    cs.IR

    Disentangled Representation with Cross Experts Covariance Loss for Multi-Domain Recommendation

    Authors: Zhutian Lin, Junwei Pan, Haibin Yu, Xi Xiao, Ximei Wang, Zhixiang Feng, Shifeng Wen, Shudong Huang, Lei Xiao, Jie Jiang

    Abstract: Multi-domain learning (MDL) has emerged as a prominent research area aimed at enhancing the quality of personalized services. The key challenge in MDL lies in striking a balance between learning commonalities across domains while preserving the distinct characteristics of each domain. However, this gives rise to a challenging dilemma. On one hand, a model needs to leverage domain-specific modules,… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  33. arXiv:2405.10292  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

    Authors: Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine

    Abstract: Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic… ▽ More

    Submitted 16 May, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

  34. arXiv:2405.10160  [pdf, other

    cs.CV cs.AI

    PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation Learning

    Authors: Jiancheng Pan, Muyuan Ma, Qing Ma, Cong Bai, Shengyong Chen

    Abstract: Remote sensing image-text retrieval constitutes a foundational aspect of remote sensing interpretation tasks, facilitating the alignment of vision and language representations. This paper introduces a prior instruction representation (PIR) learning paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations. Based on PIR, a domain-adapted remote sensing… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: 15 pages, 9 figures

  35. arXiv:2405.07510  [pdf, other

    cs.LG

    PeRFlow: Piecewise Rectified Flow as Universal Plug-and-Play Accelerator

    Authors: Hanshu Yan, Xingchao Liu, Jiachun Pan, Jun Hao Liew, Qiang Liu, Jiashi Feng

    Abstract: We present Piecewise Rectified Flow (PeRFlow), a flow-based method for accelerating diffusion models. PeRFlow divides the sampling process of generative flows into several time windows and straightens the trajectories in each interval via the reflow operation, thereby approaching piecewise linear flows. PeRFlow achieves superior performance in a few-step generation. Moreover, through dedicated par… ▽ More

    Submitted 29 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

  36. arXiv:2405.06890  [pdf, other

    cs.CL cs.AI

    TacoERE: Cluster-aware Compression for Event Relation Extraction

    Authors: Yong Guan, Xiaozhi Wang, Lei Hou, Juanzi Li, Jeff Pan, Jiaoyan Chen, Freddy Lecue

    Abstract: Event relation extraction (ERE) is a critical and fundamental challenge for natural language processing. Existing work mainly focuses on directly modeling the entire document, which cannot effectively handle long-range dependencies and information redundancy. To address these issues, we propose a cluster-aware compression method for improving event relation extraction (TacoERE), which explores a c… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

    Comments: Accepted to LREC-COLING 2024

  37. arXiv:2405.06524  [pdf, other

    cs.CL

    Prompting Large Language Models with Knowledge Graphs for Question Answering Involving Long-tail Facts

    Authors: Wenyu Huang, Guancheng Zhou, Mirella Lapata, Pavlos Vougiouklis, Sebastien Montella, Jeff Z. Pan

    Abstract: Although Large Language Models (LLMs) are effective in performing various NLP tasks, they still struggle to handle tasks that require extensive, real-world knowledge, especially when dealing with long-tail facts (facts related to long-tail entities). This limitation highlights the need to supplement LLMs with non-parametric knowledge. To address this issue, we analysed the effects of different typ… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  38. arXiv:2405.05001  [pdf, other

    cs.CV

    HMANet: Hybrid Multi-Axis Aggregation Network for Image Super-Resolution

    Authors: Shu-Chuan Chu, Zhi-Chao Dou, Jeng-Shyang Pan, Shaowei Weng, Junbao Li

    Abstract: Transformer-based methods have demonstrated excellent performance on super-resolution visual tasks, surpassing conventional convolutional neural networks. However, existing work typically restricts self-attention computation to non-overlapping windows to save computational costs. This means that Transformer-based networks can only use input information from a limited spatial range. Therefore, a no… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: 12 pages, 10 figures, conference

  39. arXiv:2404.18081  [pdf, other

    cs.SD cs.AI cs.CL cs.LG cs.MM eess.AS

    ComposerX: Multi-Agent Symbolic Music Composition with LLMs

    Authors: Qixin Deng, Qikai Yang, Ruibin Yuan, Yipeng Huang, Yi Wang, Xubo Liu, Zeyue Tian, Jiahao Pan, Ge Zhang, Hanfeng Lin, Yizhi Li, Yinghao Ma, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wenwu Wang, Guangyu Xia, Wei Xue, Yike Guo

    Abstract: Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and C… ▽ More

    Submitted 30 April, 2024; v1 submitted 28 April, 2024; originally announced April 2024.

  40. arXiv:2404.17621  [pdf, other

    eess.IV cs.CV cs.LG

    Attention-aware non-rigid image registration for accelerated MR imaging

    Authors: Aya Ghoul, Jiazhen Pan, Andreas Lingg, Jens Kübler, Patrick Krumm, Kerstin Hammernik, Daniel Rueckert, Sergios Gatidis, Thomas Küstner

    Abstract: Accurate motion estimation at high acceleration factors enables rapid motion-compensated reconstruction in Magnetic Resonance Imaging (MRI) without compromising the diagnostic image quality. In this work, we introduce an attention-aware deep learning-based framework that can perform non-rigid pairwise registration for fully sampled and accelerated MRI. We extract local visual representations to bu… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: 14 pages, 7 figures

  41. arXiv:2404.17590  [pdf, other

    cs.IR cs.AI

    Leveraging Intra-modal and Inter-modal Interaction for Multi-Modal Entity Alignment

    Authors: Zhiwei Hu, Víctor Gutiérrez-Basulto, Zhiliang Xiang, Ru Li, Jeff Z. Pan

    Abstract: Multi-modal entity alignment (MMEA) aims to identify equivalent entity pairs across different multi-modal knowledge graphs (MMKGs). Existing approaches focus on how to better encode and aggregate information from different modalities. However, it is not trivial to leverage multi-modal knowledge in entity alignment due to the modal heterogeneity. In this paper, we propose a Multi-Grained Interactio… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  42. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi Jin, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  43. arXiv:2404.14700  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    FlashSpeech: Efficient Zero-Shot Speech Synthesis

    Authors: Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, Wei Xue

    Abstract: Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large… ▽ More

    Submitted 24 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: Efficient zero-shot speech synthesis

  44. Deep Pattern Network for Click-Through Rate Prediction

    Authors: Hengyu Zhang, Junwei Pan, Dapeng Liu, Jie Jiang, Xiu Li

    Abstract: Click-through rate (CTR) prediction tasks play a pivotal role in real-world applications, particularly in recommendation systems and online advertising. A significant research branch in this domain focuses on user behavior modeling. Current research predominantly centers on modeling co-occurrence relationships between the target item and items previously interacted with by users in their historica… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Comments: 12 pages, 10 figures, accepted by SIGIR2024

  45. arXiv:2404.10343  [pdf, other

    cs.CV eess.IV

    The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

    Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

  46. arXiv:2404.09848  [pdf, other

    cs.AI cs.LG

    HyperMono: A Monotonicity-aware Approach to Hyper-Relational Knowledge Representation

    Authors: Zhiwei Hu, Víctor Gutiérrez-Basulto, Zhiliang Xiang, Ru Li, Jeff Z. Pan

    Abstract: In a hyper-relational knowledge graph (HKG), each fact is composed of a main triple associated with attribute-value qualifiers, which express additional factual knowledge. The hyper-relational knowledge graph completion (HKGC) task aims at inferring plausible missing links in a HKG. Most existing approaches to HKGC focus on enhancing the communication between qualifier pairs and main triples, whil… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  47. Countering Mainstream Bias via End-to-End Adaptive Local Learning

    Authors: Jinhao Pan, Ziwei Zhu, Jianling Wang, Allen Lin, James Caverlee

    Abstract: Collaborative filtering (CF) based recommendations suffer from mainstream bias -- where mainstream users are favored over niche users, leading to poor recommendation quality for many long-tail users. In this paper, we identify two root causes of this mainstream bias: (i) discrepancy modeling, whereby CF algorithms focus on modeling mainstream users while neglecting niche users with unique preferen… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: ECIR 2024

    Journal ref: In European Conference on Information Retrieval 2024, vol 14612 (pp. 75-89)

  48. arXiv:2404.08857  [pdf, other

    cs.SD cs.AI eess.AS

    Voice Attribute Editing with Text Prompt

    Authors: Zhengyan Sheng, Yang Ai, Li-Juan Liu, Jia Pan, Zhen-Hua Ling

    Abstract: Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modifications to voice attributes according to the actions described in the text prompt. To solve this t… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

  49. arXiv:2404.07546  [pdf, other

    cs.CL

    Decomposing Label Space, Format and Discrimination: Rethinking How LLMs Respond and Solve Tasks via In-Context Learning

    Authors: Quanyu Long, Yin Wu, Wenya Wang, Sinno Jialin Pan

    Abstract: In-context Learning (ICL) has emerged as a powerful capability alongside the development of scaled-up large language models (LLMs). By instructing LLMs using few-shot demonstrative examples, ICL enables them to perform a wide range of tasks without updating millions of parameters. However, the precise contributions of demonstrations towards improving end-task performance have not been thoroughly i… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: 36 pages, 8 figures

  50. arXiv:2404.06474  [pdf, other

    cs.AI

    Autonomous Evaluation and Refinement of Digital Agents

    Authors: Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr

    Abstract: We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We experiment with multiple evaluation models that trade off between inference cost, modularity of design, and accuracy. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with… ▽ More

    Submitted 10 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: Code at https://github.com/Berkeley-NLP/Agent-Eval-Refine