Skip to main content

Showing 1–50 of 3,494 results for author: Su

  1. arXiv:2407.11882  [pdf, other

    cs.CR

    Enhancing Covert Communication in Relay Systems Using Multi-Antenna Technique

    Authors: He Zhu, Huihui Wu, Wei Su, Xiaohong Jiang

    Abstract: This paper exploits the multi-antenna technique to enhance the covert communication performance in a relay system, where a source S conducts covert communication with a destination D via a relay R, subjecting to the detections of transmissions in the two hops from a single-antenna warden W. To demonstrate the performance gain from adopting the multi-antenna technique, we first consider the scenari… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  2. arXiv:2407.11683  [pdf, other

    cs.CV

    Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning

    Authors: Yunbin Tu, Liang Li, Li Su, Chenggang Yan, Qingming Huang

    Abstract: Change captioning aims to succinctly describe the semantic change between a pair of similar images, while being immune to distractors (illumination and viewpoint changes). Under these distractors, unchanged objects often appear pseudo changes about location and scale, and certain objects might overlap others, resulting in perturbational and discrimination-degraded features between two images. Howe… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  3. arXiv:2407.11637  [pdf, other

    cs.CV

    REMM:Rotation-Equivariant Framework for End-to-End Multimodal Image Matching

    Authors: Han Nie, Bin Luo, Jun Liu, Zhitao Fu, Weixing Liu, Xin Su

    Abstract: We present REMM, a rotation-equivariant framework for end-to-end multimodal image matching, which fully encodes rotational differences of descriptors in the whole matching pipeline. Previous learning-based methods mainly focus on extracting modal-invariant descriptors, while consistently ignoring the rotational invariance. In this paper, we demonstrate that our REMM is very useful for multimodal i… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: 13 pages, 13 figures

  4. arXiv:2407.11537  [pdf, other

    cs.CV cs.AI

    AEMIM: Adversarial Examples Meet Masked Image Modeling

    Authors: Wenzhao Xiang, Chang Liu, Hang Su, Hongyang Yu

    Abstract: Masked image modeling (MIM) has gained significant traction for its remarkable prowess in representation learning. As an alternative to the traditional approach, the reconstruction from corrupted images has recently emerged as a promising pretext task. However, the regular corrupted images are generated using generic generators, often lacking relevance to the specific reconstruction task involved… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Under review of International Journal of Computer Vision (IJCV)

  5. arXiv:2407.11536  [pdf, other

    cs.CL cs.AI

    Fine-Tuning Medical Language Models for Enhanced Long-Contextual Understanding and Domain Expertise

    Authors: Qimin Yang, Rongsheng Wang, Jiexin Chen, Runqi Su, Tao Tan

    Abstract: Large Language Models (LLMs) have been widely applied in various professional fields. By fine-tuning the models using domain specific question and answer datasets, the professional domain knowledge and Q\&A abilities of these models have significantly improved, for example, medical professional LLMs that use fine-tuning of doctor-patient Q\&A data exhibit extraordinary disease diagnostic abilities… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: 5 pages, 1 figure. Accepted by the Workshop on Long-Context Foundation Models (LCFM) at ICML 2024

  6. arXiv:2407.11449  [pdf, other

    cs.CV cs.AI

    Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

    Authors: Shunqi Mao, Chaoyi Zhang, Hang Su, Hwanjun Song, Igor Shalyminov, Weidong Cai

    Abstract: Contextualized Image Captioning (CIC) evolves traditional image captioning into a more complex domain, necessitating the ability for multimodal reasoning. It aims to generate image captions given specific contextual information. This paper further introduces a novel domain of Controllable Contextualized Image Captioning (Ctrl-CIC). Unlike CIC, which solely relies on broad context, Ctrl-CIC accentu… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  7. arXiv:2407.11180  [pdf, other

    cs.LG

    Transformer-based Drum-level Prediction in a Boiler Plant with Delayed Relations among Multivariates

    Authors: Gang Su, Sun Yang, Zhishuai Li

    Abstract: The steam drum water level is a critical parameter that directly impacts the safety and efficiency of power plant operations. However, predicting the drum water level in boilers is challenging due to complex non-linear process dynamics originating from long-time delays and interrelations, as well as measurement noise. This paper investigates the application of Transformer-based models for predicti… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: under review of IEEE PES ISGT Europe 2024 conference

  8. arXiv:2407.10655  [pdf, other

    cs.CV

    OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

    Authors: Yu Wang, Xiangbo Su, Qiang Chen, Xinyu Zhang, Teng Xi, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang

    Abstract: Open-vocabulary object detection focusing on detecting novel categories guided by natural language. In this report, we propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency. Building upon OVLW-DETR, we provide an end-to-end training recipe that transferring knowledge from vision-language mode… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 4 pages

  9. arXiv:2407.10285  [pdf, other

    cs.CV

    Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

    Authors: Qinyu Yang, Haoxin Chen, Yong Zhang, Menghan Xia, Xiaodong Cun, Zhixun Su, Ying Shan

    Abstract: In order to improve the quality of synthesized videos, currently, one predominant method involves retraining an expert diffusion model and then implementing a noising-denoising process for refinement. Despite the significant training costs, maintaining consistency of content between the original and enhanced videos remains a major challenge. To tackle this challenge, we propose a novel formulation… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: ECCV 2024, Project Page: https://yangqy1110.github.io/NC-SDEdit/, Code Repo: https://github.com/yangqy1110/NC-SDEdit/

    ACM Class: I.2; I.4.3

  10. arXiv:2407.09816  [pdf, other

    cs.CL

    MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts

    Authors: Zhenpeng Su, Zijia Lin, Xue Bai, Xing Wu, Yizhe Xiong, Haoran Lian, Guangyuan Ma, Hui Chen, Guiguang Ding, Wei Zhou, Songlin Hu

    Abstract: Scaling model capacity enhances its capabilities but significantly increases computation. Mixture-of-Experts models (MoEs) address this by allowing model capacity to scale without substantially increasing training or inference costs. Despite their promising results, MoE models encounter several challenges. Primarily, the dispersion of training tokens across multiple experts can lead to underfittin… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: Work in progress

  11. arXiv:2407.09552  [pdf

    cs.CV cs.GR

    Optimized 3D Point Labeling with Leaders Using the Beams Displacement Method

    Authors: Zhiwei Wei, Nai Yang, Wenjia Xu, Su Ding

    Abstract: In three-dimensional geographical scenes, adding labels with leader lines to point features can significantly improve their visibility. Leadered labels have a large degree of freedom in position con-figuration, but existing methods are mostly based on limited position candidate models, which not only fail to effectively utilize the map space but also make it difficult to consider the relative rela… ▽ More

    Submitted 27 June, 2024; originally announced July 2024.

    Comments: 12 pages, in Chinese language, 10 figures, an accepted version of ChinaVis2024

  12. arXiv:2407.09530  [pdf

    cs.CV cs.AI cs.RO

    Optimization of Autonomous Driving Image Detection Based on RFAConv and Triplet Attention

    Authors: Zhipeng Ling, Qi Xin, Yiyu Lin, Guangze Su, Zuwei Shui

    Abstract: YOLOv8 plays a crucial role in the realm of autonomous driving, owing to its high-speed target detection, precise identification and positioning, and versatile compatibility across multiple platforms. By processing video streams or images in real-time, YOLOv8 rapidly and accurately identifies obstacles such as vehicles and pedestrians on roadways, offering essential visual data for autonomous driv… ▽ More

    Submitted 25 June, 2024; originally announced July 2024.

    Comments: 13 pages

  13. arXiv:2407.09417  [pdf, other

    cs.CL cs.IR

    Mitigating Entity-Level Hallucination in Large Language Models

    Authors: Weihang Su, Yichen Tang, Qingyao Ai, Changyue Wang, Zhijing Wu, Yiqun Liu

    Abstract: The emergence of Large Language Models (LLMs) has revolutionized how users access information, shifting from traditional search engines to direct question-and-answer interactions with LLMs. However, the widespread adoption of LLMs has revealed a significant challenge known as hallucination, wherein LLMs generate coherent yet factually inaccurate responses. This hallucination phenomenon has led to… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  14. arXiv:2407.09024  [pdf, other

    cs.LG

    Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control

    Authors: Huayu Chen, Kaiwen Zheng, Hang Su, Jun Zhu

    Abstract: Drawing upon recent advances in language model alignment, we formulate offline Reinforcement Learning as a two-stage optimization problem: First pretraining expressive generative policies on reward-free behavior datasets, then fine-tuning these policies to align with task-specific annotations like Q-values. This strategy allows us to leverage abundant and diverse behavior data to enhance generaliz… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  15. arXiv:2407.08961  [pdf

    eess.IV cs.CV

    Tissue-Contrastive Semi-Masked Autoencoders for Segmentation Pretraining on Chest CT

    Authors: Jie Zheng, Ru Wen, Haiqin Hu, Lina Wei, Kui Su, Wei Chen, Chen Liu, Jun Wang

    Abstract: Existing Masked Image Modeling (MIM) depends on a spatial patch-based masking-reconstruction strategy to perceive objects'features from unlabeled images, which may face two limitations when applied to chest CT: 1) inefficient feature learning due to complex anatomical details presented in CT images, and 2) suboptimal knowledge transfer owing to input disparity between upstream and downstream model… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  16. arXiv:2407.08949  [pdf, other

    cs.CV

    One-Shot Pose-Driving Face Animation Platform

    Authors: He Feng, Donglin Di, Yongjia Ma, Wei Chen, Tonghua Su

    Abstract: The objective of face animation is to generate dynamic and expressive talking head videos from a single reference face, utilizing driving conditions derived from either video or audio inputs. Current approaches often require fine-tuning for specific identities and frequently fail to produce expressive videos due to the limited effectiveness of Wav2Pose modules. To facilitate the generation of one-… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  17. arXiv:2407.08762  [pdf, ps, other

    cs.SI cs.LG

    Commute-Time-Optimised Graphs for GNNs

    Authors: Igor Sterner, Shiye Su, Petar Veličković

    Abstract: We explore graph rewiring methods that optimise commute time. Recent graph rewiring approaches facilitate long-range interactions in sparse graphs, making such rewirings commute-time-optimal $\textit{on average}$. However, when an expert prior exists on which node pairs should or should not interact, a superior rewiring would favour short commute times between these privileged node pairs. We const… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  18. arXiv:2407.08558  [pdf, other

    cs.AI

    ST-Mamba: Spatial-Temporal Mamba for Traffic Flow Estimation Recovery using Limited Data

    Authors: Doncheng Yuan, Jianzhe Xue, Jinshan Su, Wenchao Xu, Haibo Zhou

    Abstract: Traffic flow estimation (TFE) is crucial for urban intelligent traffic systems. While traditional on-road detectors are hindered by limited coverage and high costs, cloud computing and data mining of vehicular network data, such as driving speeds and GPS coordinates, present a promising and cost-effective alternative. Furthermore, minimizing data collection can significantly reduce overhead. Howev… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: Accepted by 2024 IEEE/CIC International Conference on Communications in China (ICCC)

  19. arXiv:2407.08268  [pdf, other

    cs.CV

    Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

    Authors: Tong Shao, Zhuotao Tian, Hang Zhao, Jingyong Su

    Abstract: CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature cor… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: ECCV24 accepted

  20. arXiv:2407.08234  [pdf, other

    cs.RO eess.SY

    Model Predictive Control For Mobile Manipulators Based On Neural Dynamics(Extended version)

    Authors: Tao Su, Shiqi Zheng

    Abstract: This article focuses on the trajectory tracking problem of mobile manipulators (MMs). Firstly, we construct a position and orientation model predictive tracking control (POMPTC) scheme for mobile manipulators. The proposed POMPTC scheme can simultaneously minimize the tracking error, joint velocity, and joint acceleration. Moreover, it can achieve synchronous control for the position and orientati… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: This article consists of 13 pages, including the text and the proof process

  21. arXiv:2407.08206  [pdf

    cs.CL

    System Report for CCL24-Eval Task 7: Multi-Error Modeling and Fluency-Targeted Pre-training for Chinese Essay Evaluation

    Authors: Jingshen Zhang, Xiangyu Yang, Xinkai Su, Xinglu Chen, Tianyou Huang, Xinying Qiu

    Abstract: This system report presents our approaches and results for the Chinese Essay Fluency Evaluation (CEFE) task at CCL-2024. For Track 1, we optimized predictions for challenging fine-grained error types using binary classification models and trained coarse-grained models on the Chinese Learner 4W corpus. In Track 2, we enhanced performance by constructing a pseudo-dataset with multiple error types pe… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  22. arXiv:2407.08150  [pdf, other

    cs.CV

    Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding

    Authors: Minghui Wu, Chenxu Zhao, Anyang Su, Donglin Di, Tianyu Fu, Da An, Min He, Ya Gao, Meng Ma, Kun Yan, Ping Wang

    Abstract: Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. There is currently a lack of research in this area, and most existing benchmarks suffer from several drawbacks: 1) a limited number of modalities and answers with restrictive length; 2) the content and scenarios within… ▽ More

    Submitted 16 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: Accepted by ACM MULTIMEDIA 2024

  23. arXiv:2407.07930  [pdf

    q-bio.BM cs.LG

    Token-Mol 1.0: Tokenized drug design with large language model

    Authors: Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou

    Abstract: Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

  24. arXiv:2407.07433  [pdf, other

    cs.CV cs.AI

    Controllable Navigation Instruction Generation with Chain of Thought Prompting

    Authors: Xianghao Kong, Jinyu Chen, Wenguan Wang, Hang Su, Xiaolin Hu, Yi Yang, Si Liu

    Abstract: Instruction generation is a vital and multidisciplinary research area with broad applications. Existing instruction generation models are limited to generating instructions in a single style from a particular dataset, and the style and content of generated instructions cannot be controlled. Moreover, most existing instruction generation methods also disregard the spatial modeling of the navigation… ▽ More

    Submitted 16 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  25. arXiv:2407.07365  [pdf, other

    cs.CV

    High-Resolution Cloud Detection Network

    Authors: Jingsheng Li, Tianxiang Xue, Jiayi Zhao, Jingmin Ge, Yufang Min, Wei Su, Kun Zhan

    Abstract: The complexity of clouds, particularly in terms of texture detail at high resolutions, has not been well explored by most existing cloud detection networks. This paper introduces the High-Resolution Cloud Detection Network (HR-cloud-Net), which utilizes a hierarchical high-resolution integration approach. HR-cloud-Net integrates a high-resolution representation module, layer-wise cascaded feature… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Journal of Electronic Imaging

  26. arXiv:2407.07089  [pdf, other

    cs.LG

    Fine-Tuning Linear Layers Only Is a Simple yet Effective Way for Task Arithmetic

    Authors: Ruochen Jin, Bojian Hou, Jiancong Xiao, Weijie Su, Li Shen

    Abstract: Task arithmetic has recently emerged as a cost-effective and scalable approach to edit pre-trained models directly in weight space, by adding the fine-tuned weights of different tasks. The performance has been further improved by a linear property which is illustrated by weight disentanglement. Yet, conventional linearization methods (e.g., NTK linearization) not only double the time and training… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  27. arXiv:2407.07038  [pdf, other

    cs.CL

    Decoding Climate Disagreement: A Graph Neural Network-Based Approach to Understanding Social Media Dynamics

    Authors: Ruiran Su, Janet B. Pierrehumbert

    Abstract: This work introduces the ClimateSent-GAT Model, an innovative method that integrates Graph Attention Networks (GATs) with techniques from natural language processing to accurately identify and predict disagreements within Reddit comment-reply pairs. Our model classifies disagreements into three categories: agree, disagree, and neutral. Leveraging the inherent graph structure of Reddit comment-repl… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  28. arXiv:2407.07024  [pdf, other

    cs.CV cs.AI

    Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

    Authors: Jeongseok Hyun, Su Ho Han, Hyolim Kang, Joon-Young Lee, Seon Joo Kim

    Abstract: The vocabulary size in temporal action localization (TAL) is constrained by the scarcity of large-scale annotated datasets. To address this, recent works incorporate powerful pre-trained vision-language models (VLMs), such as CLIP, to perform open-vocabulary TAL (OV-TAL). However, unlike VLMs trained on extensive image/video-text pairs, existing OV-TAL methods still rely on small, fully labeled TA… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  29. arXiv:2407.06329  [pdf, other

    cs.LG cs.AI

    Solving Multi-Model MDPs by Coordinate Ascent and Dynamic Programming

    Authors: Xihong Su, Marek Petrik

    Abstract: Multi-model Markov decision process (MMDP) is a promising framework for computing policies that are robust to parameter uncertainty in MDPs. MMDPs aim to find a policy that maximizes the expected return over a distribution of MDP models. Because MMDPs are NP-hard to solve, most methods resort to approximations. In this paper, we derive the policy gradient of MMDPs and propose CADP, which combines… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: Accepted at UAI 2023

  30. arXiv:2407.06135  [pdf, other

    cs.CL cs.AI cs.CV

    ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

    Authors: Ethan Chern, Jiadi Su, Yan Ma, Pengfei Liu

    Abstract: Previous open-source large multimodal models (LMMs) have faced several limitations: (1) they often lack native integration, requiring adapters to align visual representations with pre-trained large language models (LLMs); (2) many are restricted to single-modal generation; (3) while some support multimodal generation, they rely on separate diffusion models for visual modeling and generation. To mi… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  31. arXiv:2407.05993  [pdf, other

    cs.CV

    Self-Prior Guided Mamba-UNet Networks for Medical Image Super-Resolution

    Authors: Zexin Ji, Beiji Zou, Xiaoyan Kui, Pierre Vera, Su Ruan

    Abstract: In this paper, we propose a self-prior guided Mamba-UNet network (SMamba-UNet) for medical image super-resolution. Existing methods are primarily based on convolutional neural networks (CNNs) or Transformers. CNNs-based methods fail to capture long-range dependencies, while Transformer-based approaches face heavy calculation challenges due to their quadratic computational complexity. Recently, Sta… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  32. arXiv:2407.05969  [pdf, other

    cs.CV

    Deform-Mamba Network for MRI Super-Resolution

    Authors: Zexin Ji, Beiji Zou, Xiaoyan Kui, Pierre Vera, Su Ruan

    Abstract: In this paper, we propose a new architecture, called Deform-Mamba, for MR image super-resolution. Unlike conventional CNN or Transformer-based super-resolution approaches which encounter challenges related to the local respective field or heavy computational cost, our approach aims to effectively explore the local and global information of images. Specifically, we develop a Deform-Mamba encoder wh… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  33. arXiv:2407.05963  [pdf, ps, other

    cs.SE cs.AI cs.NI cs.SI

    6GSoft: Software for Edge-to-Cloud Continuum

    Authors: Muhammad Azeem Akbar, Matteo Esposito, Sami Hyrynsalmi, Karthikeyan Dinesh Kumar, Valentina Lenarduzzi, Xiaozhou Li, Ali Mehraj, Tommi Mikkonen, Sergio Moreschini, Niko Mäkitalo, Markku Oivo, Anna-Sofia Paavonen, Risha Parveen, Kari Smolander, Ruoyu Su, Kari Systä, Davide Taibi, Nan Yang, Zheying Zhang, Muhammad Zohaib

    Abstract: In the era of 6G, developing and managing software requires cutting-edge software engineering (SE) theories and practices tailored for such complexity across a vast number of connected edge devices. Our project aims to lead the development of sustainable methods and energy-efficient orchestration models specifically for edge environments, enhancing architectural support driven by AI for contempora… ▽ More

    Submitted 9 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

  34. arXiv:2407.05638  [pdf, other

    cs.CV

    HPFF: Hierarchical Locally Supervised Learning with Patch Feature Fusion

    Authors: Junhao Su, Chenghao He, Feiyu Zhu, Xiaojie Xu, Dongzhi Guan, Chenyang Si

    Abstract: Traditional deep learning relies on end-to-end backpropagation for training, but it suffers from drawbacks such as high memory consumption and not aligning with biological neural networks. Recent advancements have introduced locally supervised learning, which divides networks into modules with isolated gradients and trains them locally. However, this approach can lead to performance lag due to lim… ▽ More

    Submitted 8 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  35. arXiv:2407.05623  [pdf, other

    cs.CV

    Momentum Auxiliary Network for Supervised Local Learning

    Authors: Junhao Su, Changpeng Cai, Feiyu Zhu, Chenghao He, Xiaojie Xu, Dongzhi Guan, Chenyang Si

    Abstract: Deep neural networks conventionally employ end-to-end backpropagation for their training process, which lacks biological credibility and triggers a locking dilemma during network parameter updates, leading to significant GPU memory use. Supervised local learning, which segments the network into multiple local blocks updated by independent auxiliary networks. However, these methods cannot replace e… ▽ More

    Submitted 9 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  36. arXiv:2407.05577  [pdf, other

    cs.CV

    Audio-driven High-resolution Seamless Talking Head Video Editing via StyleGAN

    Authors: Jiacheng Su, Kunhong Liu, Liyan Chen, Junfeng Yao, Qingsong Liu, Dongdong Lv

    Abstract: The existing methods for audio-driven talking head video editing have the limitations of poor visual effects. This paper tries to tackle this problem through editing talking face images seamless with different emotions based on two modules: (1) an audio-to-landmark module, consisting of the CrossReconstructed Emotion Disentanglement and an alignment network module. It bridges the gap between speec… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  37. arXiv:2407.05576  [pdf, other

    cs.CV

    ORMNet: Object-centric Relationship Modeling for Egocentric Hand-object Segmentation

    Authors: Yuejiao Su, Yi Wang, Lap-Pui Chau

    Abstract: Egocentric hand-object segmentation (EgoHOS) is a brand-new task aiming at segmenting the hands and interacting objects in the egocentric image. Although significant advancements have been achieved by current methods, establishing an end-to-end model with high accuracy remains an unresolved challenge. Moreover, existing methods lack explicit modeling of the relationships between hands and objects… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  38. arXiv:2407.05389  [pdf, other

    cs.CV cs.AI

    Image-Conditional Diffusion Transformer for Underwater Image Enhancement

    Authors: Xingyang Nie, Su Pan, Xiaoyu Zhai, Shifei Tao, Fengzhong Qu, Biao Wang, Huilin Ge, Guojie Xiao

    Abstract: Underwater image enhancement (UIE) has attracted much attention owing to its importance for underwater operation and marine engineering. Motivated by the recent advance in generative models, we propose a novel UIE method based on image-conditional diffusion transformer (ICDT). Our method takes the degraded underwater image as the conditional input and converts it into latent space where ICDT is ap… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  39. arXiv:2407.05248  [pdf, other

    cs.CV

    Self-Paced Sample Selection for Barely-Supervised Medical Image Segmentation

    Authors: Junming Su, Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane

    Abstract: The existing barely-supervised medical image segmentation (BSS) methods, adopting a registration-segmentation paradigm, aim to learn from data with very few annotations to mitigate the extreme label scarcity problem. However, this paradigm poses a challenge: pseudo-labels generated by image registration come with significant noise. To address this issue, we propose a self-paced sample selection fr… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

    Comments: Accepted to MICCAI 2024

  40. arXiv:2407.05229  [pdf, other

    cs.LG

    HiDe-PET: Continual Learning via Hierarchical Decomposition of Parameter-Efficient Tuning

    Authors: Liyuan Wang, Jingyi Xie, Xingxing Zhang, Hang Su, Jun Zhu

    Abstract: The deployment of pre-trained models (PTMs) has greatly advanced the field of continual learning (CL), enabling positive knowledge transfer and resilience to catastrophic forgetting. To sustain these advantages for sequentially arriving tasks, a promising direction involves keeping the pre-trained backbone frozen while employing parameter-efficient tuning (PET) techniques to instruct representatio… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

    Comments: This is a generalized version of our HiDe-Prompt (NeurIPS 2023, Spotlight)

  41. arXiv:2407.05138  [pdf, other

    cs.SE cs.AI

    Vortex under Ripplet: An Empirical Study of RAG-enabled Applications

    Authors: Yuchen Shao, Yuheng Huang, Jiawei Shen, Lei Ma, Ting Su, Chengcheng Wan

    Abstract: Large language models (LLMs) enhanced by retrieval-augmented generation (RAG) provide effective solutions in various application scenarios. However, developers face challenges in integrating RAG-enhanced LLMs into software systems, due to lack of interface specification, requirements from software context, and complicated system management. In this paper, we manually studied 100 open-source applic… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

  42. arXiv:2407.05098  [pdf, other

    cs.LG cs.AI

    FedTSA: A Cluster-based Two-Stage Aggregation Method for Model-heterogeneous Federated Learning

    Authors: Boyu Fan, Chenrui Wu, Xiang Su, Pan Hui

    Abstract: Despite extensive research into data heterogeneity in federated learning (FL), system heterogeneity remains a significant yet often overlooked challenge. Traditional FL approaches typically assume homogeneous hardware resources across FL clients, implying that clients can train a global model within a comparable time frame. However, in practical FL systems, clients often have heterogeneous resourc… ▽ More

    Submitted 15 July, 2024; v1 submitted 6 July, 2024; originally announced July 2024.

    Comments: Accepted at ECCV 2024

  43. arXiv:2407.03884  [pdf, other

    cs.CL cs.AI

    Planning with Large Language Models for Conversational Agents

    Authors: Zhigen Li, Jianxiang Peng, Yanmeng Wang, Tianhao Shen, Minghui Zhang, Linxi Su, Shang Wu, Yihang Wu, Yuqian Wang, Ye Wang, Wei Hu, Jianfeng Li, Shaojun Wang, Jing Xiao, Deyi Xiong

    Abstract: Controllability and proactivity are crucial properties of autonomous conversational agents (CAs). Controllability requires the CAs to follow the standard operating procedures (SOPs), such as verifying identity before activating credit cards. Proactivity requires the CAs to guide the conversation towards the goal during user uncooperation, such as persuasive dialogue. Existing research cannot be un… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  44. arXiv:2407.03804  [pdf, other

    cs.LG cs.NI

    Multi-Time Scale Service Caching and Pricing in MEC Systems with Dynamic Program Popularity

    Authors: Yiming Chen, Xingyuan Hu, Bo Gu, Shimin Gong, Zhou Su

    Abstract: In mobile edge computing systems, base stations (BSs) equipped with edge servers can provide computing services to users to reduce their task execution time. However, there is always a conflict of interest between the BS and users. The BS prices the service programs based on user demand to maximize its own profit, while the users determine their offloading strategies based on the prices to minimiz… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  45. arXiv:2407.03724  [pdf, other

    cs.RO

    Flight Structure Optimization of Modular Reconfigurable UAVs

    Authors: Yao Su, Ziyuan Jiao, Zeyu Zhang, Jingwen Zhang, Hang Li, Meng Wang, Hangxin Liu

    Abstract: This paper presents a Genetic Algorithm (GA) designed to reconfigure a large group of modular Unmanned Aerial Vehicles (UAVs), each with different weights and inertia parameters, into an over-actuated flight structure with improved dynamic properties. Previous research efforts either utilized expert knowledge to design flight structures for a specific task or relied on enumeration-based algorithms… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  46. arXiv:2407.03695  [pdf, other

    cs.CV

    M^3:Manipulation Mask Manufacturer for Arbitrary-Scale Super-Resolution Mask

    Authors: Xinyu Yang, Xiaochen Ma, Xuekang Zhu, Bo Du, Lei Su, Bingkui Tong, Zeyu Lei, Jizhe Zhou

    Abstract: In the field of image manipulation localization (IML), the small quantity and poor quality of existing datasets have always been major issues. A dataset containing various types of manipulations will greatly help improve the accuracy of IML models. Images on the internet (such as those on Baidu Tieba's PS Bar) are manipulated using various techniques, and creating a dataset from these images will… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  47. arXiv:2407.02894  [pdf, other

    cs.CL cs.AI

    Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation

    Authors: Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Min Zhang, Jinsong Su

    Abstract: In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this regard, conventional cascaded methods suffer from issues such as error propagation, massive parameters, and difficulties in deployment and retaining visual characteristics of the input image. Thus, constructing end-to-end models has be… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted to ACL 2024 Findings

  48. arXiv:2407.01921  [pdf, other

    cs.CV

    GVDIFF: Grounded Text-to-Video Generation with Diffusion Models

    Authors: Huanzhang Dou, Ruixiang Li, Wei Su, Xi Li

    Abstract: In text-to-video (T2V) generation, significant attention has been directed toward its development, yet unifying discrete and continuous grounding conditions in T2V generation remains under-explored. This paper proposes a Grounded text-to-Video generation framework, termed GVDIFF. First, we inject the grounding condition into the self-attention through an uncertainty-based representation to explici… ▽ More

    Submitted 4 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

  49. arXiv:2407.01636  [pdf, other

    cs.CV

    Learning Frequency-Aware Dynamic Transformers for All-In-One Image Restoration

    Authors: Zenglin Shi, Tong Su, Pei Liu, Yunpeng Wu, Le Zhang, Meng Wang

    Abstract: This work aims to tackle the all-in-one image restoration task, which seeks to handle multiple types of degradation with a single model. The primary challenge is to extract degradation representations from the input degraded images and use them to guide the model's adaptation to specific degradation types. Recognizing that various degradations affect image content differently across frequency band… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: 8 pages

  50. arXiv:2407.01312  [pdf, other

    cs.CV

    ToCoAD: Two-Stage Contrastive Learning for Industrial Anomaly Detection

    Authors: Yun Liang, Zhiguang Hu, Junjie Huang, Donglin Di, Anyang Su, Lei Fan

    Abstract: Current unsupervised anomaly detection approaches perform well on public datasets but struggle with specific anomaly types due to the domain gap between pre-trained feature extractors and target-specific domains. To tackle this issue, this paper presents a two-stage training strategy, called \textbf{ToCoAD}. In the first stage, a discriminative network is trained by using synthetic anomalies in a… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 11 pages, 7 figures