Skip to main content

Showing 1–50 of 166 results for author: Gan, Z

  1. arXiv:2407.02477  [pdf, other

    cs.CV cs.CL

    Understanding Alignment in Multimodal LLMs: A Comprehensive Study

    Authors: Elmira Amirloo, Jean-Philippe Fauconnier, Christoph Roesmann, Christian Kerl, Rinu Boney, Yusu Qian, Zirui Wang, Afshin Dehghan, Yinfei Yang, Zhe Gan, Peter Grasch

    Abstract: Preference alignment has become a crucial component in enhancing the performance of Large Language Models (LLMs), yet its impact in Multimodal Large Language Models (MLLMs) remains comparatively underexplored. Similar to language models, MLLMs for image understanding tasks encounter challenges like hallucination. In MLLMs, hallucination can occur not only by stating incorrect facts but also by pro… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  2. arXiv:2407.01509  [pdf, other

    cs.CV cs.CL

    MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

    Authors: Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan

    Abstract: We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results fro… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  3. arXiv:2406.17225  [pdf, other

    eess.IV cs.CV

    Multimodal Cross-Task Interaction for Survival Analysis in Whole Slide Pathological Images

    Authors: Songhan Jiang, Zhengyu Gan, Linghan Cai, Yifeng Wang, Yongbing Zhang

    Abstract: Survival prediction, utilizing pathological images and genomic profiles, is increasingly important in cancer analysis and prognosis. Despite significant progress, precise survival analysis still faces two main challenges: (1) The massive pixels contained in whole slide images (WSIs) complicate the process of pathological images, making it difficult to generate an effective representation of the tu… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  4. arXiv:2406.13434  [pdf, other

    cs.RO

    Tactile Aware Dynamic Obstacle Avoidance in Crowded Environment with Deep Reinforcement Learning

    Authors: Yung Chuen Ng, Qi Wen, Lim, Chun Ye Tan, Zhen Hao Gan, Meng Yee, Chuah

    Abstract: Mobile robots operating in crowded environments require the ability to navigate among humans and surrounding obstacles efficiently while adhering to safety standards and socially compliant mannerisms. This scale of the robot navigation problem may be classified as both a local path planning and trajectory optimization problem. This work presents an array of force sensors that act as a tactile laye… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  5. arXiv:2406.07314  [pdf, other

    cs.LG

    Rethinking the impact of noisy labels in graph classification: A utility and privacy perspective

    Authors: De Li, Xianxian Li, Zeming Gan, Qiyu Li, Bin Qu, Jinyan Wang

    Abstract: Graph neural networks based on message-passing mechanisms have achieved advanced results in graph classification tasks. However, their generalization performance degrades when noisy labels are present in the training data. Most existing noisy labeling approaches focus on the visual domain or graph node classification tasks and analyze the impact of noisy labels only from a utility perspective. Unl… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  6. arXiv:2406.03262  [pdf, other

    cs.CV

    ADer: A Comprehensive Benchmark for Multi-class Visual Anomaly Detection

    Authors: Jiangning Zhang, Haoyang He, Zhenye Gan, Qingdong He, Yuxuan Cai, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lei Xie, Yong Liu

    Abstract: Visual anomaly detection aims to identify anomalous regions in images through unsupervised learning paradigms, with increasing application demand and value in fields such as industrial inspection and medical lesion detection. Despite significant progress in recent years, there is a lack of comprehensive benchmarks to adequately evaluate the performance of various mainstream methods across differen… ▽ More

    Submitted 6 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

  7. arXiv:2405.20795  [pdf, other

    cs.CV cs.AI

    InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding

    Authors: Huaxiang Zhang, Yaojia Mu, Guo-Niu Zhu, Zhongxue Gan

    Abstract: Accurate visual understanding is imperative for advancing autonomous systems and intelligent robots. Despite the powerful capabilities of vision-language models (VLMs) in processing complex visual scenes, precisely recognizing obscured or ambiguously presented visual elements remains challenging. To tackle such issues, this paper proposes InsightSee, a multi-agent framework to enhance VLMs' interp… ▽ More

    Submitted 31 May, 2024; originally announced May 2024.

  8. arXiv:2405.17579  [pdf, other

    cs.RO

    Harnessing Natural Oscillations for High-Speed, Efficient Asymmetrical Locomotion in Quadrupedal Robots

    Authors: Jing Cheng, Yasser G. Alqaham, Zhenyu Gan

    Abstract: This study explores the dynamics of asymmetrical bounding gaits in quadrupedal robots, focusing on the integration of torso pitching and hip motion to enhance speed and stability. Traditional control strategies often enforce a fixed posture, minimizing natural body movements to simplify the control problem. However, this approach may overlook the inherent dynamical advantages found in natural loco… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  9. arXiv:2405.13751  [pdf, other

    cs.RO cs.AI

    GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games

    Authors: Aoran Mei, Jianhua Wang, Guo-Niu Zhu, Zhongxue Gan

    Abstract: With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code generation and show remarkable efficiency. Although VLMs demonstrate great potential in robotic task… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  10. arXiv:2405.10739  [pdf, other

    cs.CV cs.AI

    Efficient Multimodal Large Language Models: A Survey

    Authors: Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, Yabiao Wang, Chengjie Wang, Lizhuang Ma

    Abstract: In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, e… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

  11. arXiv:2404.07973  [pdf, other

    cs.CV

    Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

    Authors: Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang

    Abstract: While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: Preprint. 14 pages, 4 figures

  12. arXiv:2404.06836  [pdf, other

    cs.CV

    O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

    Authors: Muer Tie, Julong Wei, Zhengjun Wang, Ke Wu, Shansuai Yuan, Kaizhao Zhang, Jie Jia, Jieru Zhao, Zhongxue Gan, Wenchao Ding

    Abstract: Online construction of open-ended language scenes is crucial for robotic applications, where open-vocabulary interactive scene understanding is required. Recently, neural implicit representation has provided a promising direction for online interactive mapping. However, implementing open-vocabulary scene understanding capability into online neural implicit mapping still faces three challenges: lac… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  13. arXiv:2404.06564  [pdf, other

    cs.CV

    MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection

    Authors: Haoyang He, Yuhu Bai, Jiangning Zhang, Qingdong He, Hongxu Chen, Zhenye Gan, Chengjie Wang, Xiangtai Li, Guanzhong Tian, Lei Xie

    Abstract: Recent advancements in anomaly detection have seen the efficacy of CNN- and transformer-based approaches. However, CNNs struggle with long-range dependencies, while transformers are burdened by quadratic computational complexity. Mamba-based models, with their superior long-range modeling and linear efficiency, have garnered substantial attention. This study pioneers the application of Mamba to mu… ▽ More

    Submitted 14 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  14. arXiv:2404.05719  [pdf, other

    cs.CV cs.CL cs.HC

    Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

    Authors: Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan

    Abstract: Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  15. arXiv:2403.20159  [pdf, other

    cs.CV

    HGS-Mapping: Online Dense Mapping Using Hybrid Gaussian Representation in Urban Scenes

    Authors: Ke Wu, Kaizhao Zhang, Zhiwei Zhang, Shanshuai Yuan, Muer Tie, Julong Wei, Zijun Xu, Jieru Zhao, Zhongxue Gan, Wenchao Ding

    Abstract: Online dense mapping of urban scenes forms a fundamental cornerstone for scene understanding and navigation of autonomous vehicles. Recent advancements in mapping methods are mainly based on NeRF, whose rendering speed is too slow to meet online requirements. 3D Gaussian Splatting (3DGS), with its rendering speed hundreds of times faster than NeRF, holds greater potential in online dense mapping.… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

  16. arXiv:2403.12580  [pdf, other

    cs.CV

    Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection

    Authors: Chengjie Wang, Wenbing Zhu, Bin-Bin Gao, Zhenye Gan, Jianning Zhang, Zhihao Gu, Shuguang Qian, Mingang Chen, Lizhuang Ma

    Abstract: Industrial anomaly detection (IAD) has garnered significant attention and experienced rapid development. However, the recent development of IAD approach has encountered certain difficulties due to dataset limitations. On the one hand, most of the state-of-the-art methods have achieved saturation (over 99% in AUROC) on mainstream datasets such as MVTec, and the differences of methods cannot be well… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: It is accepted by CVPR2024

  17. arXiv:2403.12362  [pdf, other

    cs.CV cs.LG

    DMAD: Dual Memory Bank for Real-World Anomaly Detection

    Authors: Jianlong Hu, Xu Chen, Zhenye Gan, Jinlong Peng, Shengchuan Zhang, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Liujuan Cao, Rongrong Ji

    Abstract: Training a unified model is considered to be more suitable for practical industrial anomaly detection scenarios due to its generalization ability and storage efficiency. However, this multi-class setting, which exclusively uses normal data, overlooks the few but important accessible annotated anomalies in the real world. To address the challenge of real-world anomaly detection, we propose a new fr… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  18. arXiv:2403.09611  [pdf, other

    cs.CV cs.CL cs.LG

    MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

    Authors: Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman , et al. (7 additional authors not shown)

    Abstract: In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for la… ▽ More

    Submitted 18 April, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  19. arXiv:2402.13220  [pdf, other

    cs.CV cs.CL

    How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

    Authors: Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan

    Abstract: The remarkable advancements in Multimodal Large Language Models (MLLMs) have not rendered them immune to challenges, particularly in the context of handling deceptive information in prompts, thus producing hallucinated responses under such conditions. To quantitatively assess this vulnerability, we present MAD-Bench, a carefully curated benchmark that contains 850 test samples divided into 6 categ… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  20. A Soft Continuum Robot with Self-Controllable Variable Curvature

    Authors: Xinran Wang, Qiujie Lu, Dongmyoung Lee, Zhongxue Gan, Nicolas Rojas

    Abstract: This paper introduces a new type of soft continuum robot, called SCoReS, which is capable of self-controlling continuously its curvature at the segment level; in contrast to previous designs which either require external forces or machine elements, or whose variable curvature capabilities are discrete -- depending on the number of locking mechanisms and segments. The ability to have a variable cur… ▽ More

    Submitted 19 January, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

    Comments: Accpeted for IEEE Robotics and Automation letters in January 2024, Imperial's open access research REF 2029 open access policy

    Journal ref: IEEE Robotics and Automation Letters 2024

  21. arXiv:2401.00652  [pdf, other

    cs.CV

    From Covert Hiding to Visual Editing: Robust Generative Video Steganography

    Authors: Xueying Mao, Xiaoxiao Hu, Wanli Peng, Zhenliang Gan, Qichao Ying, Zhenxing Qian, Sheng Li, Xinpeng Zhang

    Abstract: Traditional video steganography methods are based on modifying the covert space for embedding, whereas we propose an innovative approach that embeds secret message within semantic feature for steganography during the video editing process. Although existing traditional video steganography methods display a certain level of security and embedding capacity, they lack adequate robustness against comm… ▽ More

    Submitted 31 December, 2023; originally announced January 2024.

    Comments: Under Review

  22. arXiv:2312.13503  [pdf, other

    cs.CV cs.AI

    InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models

    Authors: Bingbing Wen, Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Bill Howe, Lijuan Wang

    Abstract: In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to b… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

  23. arXiv:2311.17647  [pdf, other

    cs.CV cs.AI cs.CL

    Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

    Authors: Xiujun Li, Yujie Lu, Zhe Gan, Jianfeng Gao, William Yang Wang, Yejin Choi

    Abstract: Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks. In this work, we introduce VISUAL MODALITY INSTRUCTION (VIM), and investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning. We adapt VIM to eight be… ▽ More

    Submitted 10 June, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

    Comments: Github: https://github.com/VIM-Bench/VIM_TOOL, Model and Data: https://huggingface.co/VIM-Bench

  24. arXiv:2311.16201  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

    Authors: Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander Toshev

    Abstract: Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  25. arXiv:2310.13699  [pdf, other

    cs.HC cs.ET

    Interaction in Metaverse: A Survey

    Authors: Hong Lin, Zirun Gan, Wensheng Gan, Zhenlian Qi, Yuehua Wang, Philip S. Yu

    Abstract: Human-computer interaction (HCI) emerged with the birth of the computer and has been upgraded through decades of development. Metaverse has attracted a lot of interest with its immersive experience, and HCI is the entrance to the Metaverse for people. It is predictable that HCI will determine the immersion of the Metaverse. However, the technologies of HCI in Metaverse are not mature enough. There… ▽ More

    Submitted 27 September, 2023; originally announced October 2023.

    Comments: Preprint. 3 figures, 3 tables

  26. arXiv:2310.13398  [pdf, other

    cs.CV

    OpenAnnotate3D: Open-Vocabulary Auto-Labeling System for Multi-modal 3D Data

    Authors: Yijie Zhou, Likun Cai, Xianhui Cheng, Zhongxue Gan, Xiangyang Xue, Wenchao Ding

    Abstract: In the era of big data and large models, automatic annotating functions for multi-modal data are of great significance for real-world AI-driven applications, such as autonomous driving and embodied AI. Unlike traditional closed-set annotation, open-vocabulary annotation is essential to achieve human-level cognition capability. However, there are few open-vocabulary auto-labeling systems for multi-… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

    Comments: The source code will be released at https://github.com/Fudan-ProjectTitan/OpenAnnotate3D

  27. arXiv:2310.07704  [pdf, other

    cs.CV cs.CL

    Ferret: Refer and Ground Anything Anywhere at Any Granularity

    Authors: Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang

    Abstract: We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to r… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: 30 pages, 10 figures. Code/Project Website: https://github.com/apple/ml-ferret

  28. arXiv:2310.07699  [pdf, other

    cs.CV cs.AI cs.LG

    VeCLIP: Improving CLIP Training via Visual-enriched Captions

    Authors: Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, Meng Cao

    Abstract: Large-scale web-crawled datasets are fundamental for the success of pre-training vision-language models, such as CLIP. However, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in achieving precise image-text alignment. Existing methods utilizing large language models (LLMs) for caption rewriting have shown promise on small, curated datasets like CC3M and CC12M.… ▽ More

    Submitted 13 March, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

    Comments: CV/ML

  29. arXiv:2310.01382  [pdf, other

    cs.CL cs.LG

    Compressing LLMs: The Truth is Rarely Pure and Never Simple

    Authors: Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang

    Abstract: Despite their remarkable achievements, modern Large Language Models (LLMs) face exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs that achieve 50 - 60% sparsity and reduce the bit width to 3 or 4 bits per weight, with negligible degradation of perplexity over the uncom… ▽ More

    Submitted 16 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: Accepted to ICLR 2024

  30. arXiv:2309.17102  [pdf, other

    cs.CV

    Guiding Instruction-based Image Editing via Multimodal Large Language Models

    Authors: Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan

    Abstract: Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation… ▽ More

    Submitted 5 February, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ICLR'24 (Spotlight) ; Project at https://mllm-ie.github.io ; Code at https://github.com/tsujuifu/pytorch_mgie

  31. arXiv:2309.10020  [pdf, other

    cs.CV cs.CL

    Multimodal Foundation Models: From Specialists to General-Purpose Assistants

    Authors: Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao

    Abstract: This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: 119 pages, PDF file size 58MB; Tutorial website: https://vlp-tutorial.github.io/2023/

  32. arXiv:2308.07551  [pdf, other

    cs.CV

    FLAME-based Multi-View 3D Face Reconstruction

    Authors: Wenzhuo Zheng, Junhao Zhao, Xiaohong Liu, Yongyang Pan, Zhenghao Gan, Haozhe Han, Ning Liu

    Abstract: At present, face 3D reconstruction has broad application prospects in various fields, but the research on it is still in the development stage. In this paper, we hope to achieve better face 3D reconstruction quality by combining multi-view training framework with face parametric model Flame, propose a multi-view training and testing model MFNet (Multi-view Flame Network). We build a self-supervise… ▽ More

    Submitted 25 September, 2023; v1 submitted 14 August, 2023; originally announced August 2023.

  33. UniG-Encoder: A Universal Feature Encoder for Graph and Hypergraph Node Classification

    Authors: Minhao Zou, Zhongxue Gan, Yutong Wang, Junheng Zhang, Dongyan Sui, Chun Guan, Siyang Leng

    Abstract: Graph and hypergraph representation learning has attracted increasing attention from various research fields. Despite the decent performance and fruitful applications of Graph Neural Networks (GNNs), Hypergraph Neural Networks (HGNNs), and their well-designed variants, on some commonly used benchmark graphs and hypergraphs, they are outperformed by even a simple Multi-Layer Perceptron. This observ… ▽ More

    Submitted 3 August, 2023; originally announced August 2023.

  34. arXiv:2308.01194  [pdf, other

    cs.CV

    Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement Augmentation

    Authors: Siao Liu, Zhaoyu Chen, Yang Liu, Yuzheng Wang, Dingkang Yang, Zhile Zhao, Ziqing Zhou, Xie Yi, Wei Li, Wenqiang Zhang, Zhongxue Gan

    Abstract: Learning a policy with great generalization to unseen environments remains challenging but critical in visual reinforcement learning. Despite the success of augmentation combination in the supervised learning generalization, naively applying it to visual RL algorithms may damage the training efficiency, suffering from serve performance degradation. In this paper, we first conduct qualitative analy… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

    Comments: accepted by iccv2023

  35. arXiv:2306.07952  [pdf, other

    cs.CV cs.CL cs.LG

    MOFI: Learning Image Representations from Noisy Entity Annotated Images

    Authors: Wentao Wu, Aleksei Timofeev, Chen Chen, Bowen Zhang, Kun Duan, Shuangning Liu, Yantao Zheng, Jonathon Shlens, Xianzhi Du, Zhe Gan, Yinfei Yang

    Abstract: We present MOFI, Manifold OF Images, a new vision foundation model designed to learn image representations from noisy entity annotated images. MOFI differs from previous work in two key aspects: (i) pre-training data, and (ii) training recipe. Regarding data, we introduce a new approach to automatically assign entity labels to images from noisy image-text pairs. Our approach involves employing a n… ▽ More

    Submitted 17 March, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: Accepted to ICLR 2024

  36. arXiv:2306.04579  [pdf, other

    eess.IV cs.CV

    A Dataset for Deep Learning-based Bone Structure Analyses in Total Hip Arthroplasty

    Authors: Kaidong Zhang, Ziyang Gan, Dong Liu, Xifu Shang

    Abstract: Total hip arthroplasty (THA) is a widely used surgical procedure in orthopedics. For THA, it is of clinical significance to analyze the bone structure from the CT images, especially to observe the structure of the acetabulum and femoral head, before the surgical procedure. For such bone structure analyses, deep learning technologies are promising but require high-quality labeled data for the learn… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

    Comments: 16 pages, 17 figures

  37. arXiv:2305.10766  [pdf, other

    cs.AI cs.CR cs.CV

    Adversarial Amendment is the Only Force Capable of Transforming an Enemy into a Friend

    Authors: Chong Yu, Tao Chen, Zhongxue Gan

    Abstract: Adversarial attack is commonly regarded as a huge threat to neural networks because of misleading behavior. This paper presents an opposite perspective: adversarial attacks can be harnessed to improve neural models if amended correctly. Unlike traditional adversarial defense or adversarial training schemes that aim to improve the adversarial robustness, the proposed adversarial amendment (AdvAmd)… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to IJCAI 2023, 10 pages, 5 figures

  38. arXiv:2305.10727  [pdf, other

    cs.CV cs.LG cs.PF

    Boost Vision Transformer with GPU-Friendly Sparsity and Quantization

    Authors: Chong Yu, Tao Chen, Zhongxue Gan, Jiayuan Fan

    Abstract: The transformer extends its success from the language to the vision domain. Because of the stacked self-attention and cross-attention blocks, the acceleration deployment of vision transformer on GPU hardware is challenging and also rarely studied. This paper thoroughly designs a compression scheme to maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and quantization. Speciall… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to CVPR 2023, 11 pages, 6 figures

  39. arXiv:2305.07223  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    Transavs: End-To-End Audio-Visual Segmentation With Transformer

    Authors: Yuhang Ling, Yuxi Li, Zhenye Gan, Jiangning Zhang, Mingmin Chi, Yabiao Wang

    Abstract: Audio-Visual Segmentation (AVS) is a challenging task, which aims to segment sounding objects in video frames by exploring audio signals. Generally AVS faces two key challenges: (1) Audio signals inherently exhibit a high degree of information density, as sounds produced by multiple objects are entangled within the same audio stream; (2) Objects of the same category tend to produce similar audio s… ▽ More

    Submitted 26 December, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: 4 pages, 3 figures

  40. arXiv:2305.01622  [pdf, other

    cs.RO cs.AI

    FlowMap: Path Generation for Automated Vehicles in Open Space Using Traffic Flow

    Authors: Wenchao Ding, Jieru Zhao, Yubin Chu, Haihui Huang, Tong Qin, Chunjing Xu, Yuxiang Guan, Zhongxue Gan

    Abstract: There is extensive literature on perceiving road structures by fusing various sensor inputs such as lidar point clouds and camera images using deep neural nets. Leveraging the latest advance of neural architects (such as transformers) and bird-eye-view (BEV) representation, the road cognition accuracy keeps improving. However, how to cognize the ``road'' for automated vehicles where there is no we… ▽ More

    Submitted 11 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

    Comments: Accepted to ICRA2023

  41. arXiv:2304.14933  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    An Empirical Study of Multimodal Model Merging

    Authors: Yi-Lin Sung, Linjie Li, Kevin Lin, Zhe Gan, Mohit Bansal, Lijuan Wang

    Abstract: Model merging (e.g., via interpolation or task arithmetic) fuses multiple models trained on different tasks to generate a multi-task solution. The technique has been proven successful in previous studies, where the models are trained on similar tasks and with the same initialization. In this paper, we expand on this concept to a multimodal setup by merging transformers trained on different modalit… ▽ More

    Submitted 11 October, 2023; v1 submitted 28 April, 2023; originally announced April 2023.

    Comments: EMNLP 2023 Findings

  42. arXiv:2304.06671  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

    Authors: Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal

    Abstract: Spatial control is a core capability in controllable image generation. Advancements in layout-guided image generation have shown promising results on in-distribution (ID) datasets with similar spatial configurations. However, it is unclear how these models perform when facing out-of-distribution (OOD) samples with arbitrary, unseen layouts. In this paper, we propose LayoutBench, a diagnostic bench… ▽ More

    Submitted 14 April, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: 22 pages; Project website: https://layoutbench.github.io

  43. arXiv:2303.09061  [pdf, other

    cs.CV

    MixTeacher: Mining Promising Labels with Mixed Scale Teacher for Semi-Supervised Object Detection

    Authors: Liang Liu, Boshen Zhang, Jiangning Zhang, Wuhao Zhang, Zhenye Gan, Guanzhong Tian, Wenbing Zhu, Yabiao Wang, Chengjie Wang

    Abstract: Scale variation across object instances remains a key challenge in object detection task. Despite the remarkable progress made by modern detection models, this challenge is particularly evident in the semi-supervised case. While existing semi-supervised object detection methods rely on strict conditions to filter high-quality pseudo labels from network predictions, we observe that objects with ext… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted by CVPR 2023. Implementation available: https://github.com/lliuz/MixTeacher

  44. arXiv:2303.07582  [pdf, other

    cs.CV

    Calibrated Teacher for Sparsely Annotated Object Detection

    Authors: Haohan Wang, Liang Liu, Boshen Zhang, Jiangning Zhang, Wuhao Zhang, Zhenye Gan, Yabiao Wang, Chengjie Wang, Haoqian Wang

    Abstract: Fully supervised object detection requires training images in which all instances are annotated. This is actually impractical due to the high labor and time costs and the unavoidable missing annotations. As a result, the incomplete annotation in each image could provide misleading supervision and harm the training. Recent works on sparsely annotated object detection alleviate this problem by gener… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

  45. Transformation-Invariant Network for Few-Shot Object Detection in Remote Sensing Images

    Authors: Nanqing Liu, Xun Xu, Turgay Celik, Zongxin Gan, Heng-Chao Li

    Abstract: Object detection in remote sensing images relies on a large amount of labeled data for training. However, the increasing number of new categories and class imbalance make exhaustive annotation impractical. Few-shot object detection (FSOD) addresses this issue by leveraging meta-learning on seen base classes and fine-tuning on novel classes with limited labeled samples. Nonetheless, the substantial… ▽ More

    Submitted 16 November, 2023; v1 submitted 12 March, 2023; originally announced March 2023.

    Comments: Accepted by TGRS. Modified some errors from the previous version

  46. Iterative Few-shot Semantic Segmentation from Image Label Text

    Authors: Haohan Wang, Liang Liu, Wuhao Zhang, Jiangning Zhang, Zhenye Gan, Yabiao Wang, Chengjie Wang, Haoqian Wang

    Abstract: Few-shot semantic segmentation aims to learn to segment unseen class objects with the guidance of only a few support images. Most previous methods rely on the pixel-level label of support images. In this paper, we focus on a more challenging setting, in which only the image-level labels are available. We propose a general framework to firstly generate coarse masks with the help of the powerful vis… ▽ More

    Submitted 9 March, 2023; originally announced March 2023.

    Comments: ijcai 2022

  47. arXiv:2303.04861  [pdf, other

    cs.RO

    Energetic Analysis on the Optimal Bounding Gaits of Quadrupedal Robots

    Authors: Yasser G. Alqaham, Jing Cheng, Zhenyu Gan

    Abstract: It is often overlooked by roboticists when designing locomotion controllers for their legged machines, that energy consumption plays an important role in selecting the best gaits for locomotion at high speeds or over long distances. The purpose of this study is to examine four similar asymmetrical quadrupedal gaits that are frequently observed in legged animals in nature. To understand how a speci… ▽ More

    Submitted 8 March, 2023; originally announced March 2023.

  48. Breaking Symmetries Leads to Diverse Quadrupedal Gaits

    Authors: Jiayu Ding, Zhenyu Gan

    Abstract: Symmetry manifests itself in legged locomotion in a variety of ways. No matter where a legged system begins to move periodically, the torso and limbs coordinate with each other's movements in a similar manner. Also, in many gaits observed in nature, the legs on both sides of the torso move in exactly the same way, sometimes they are just half a period out of phase. Furthermore, when some animals m… ▽ More

    Submitted 8 April, 2024; v1 submitted 8 March, 2023; originally announced March 2023.

    Comments: Please refer to the published version to cite this paper

    Journal ref: IEEE Robotics and Automation Letters, Institute of Electrical and Electronics Engineers (IEEE), 2024

  49. arXiv:2301.10654  [pdf

    cs.NE nlin.AO

    Self-Evolutionary Reservoir Computer Based on Kuramoto Model

    Authors: Zhihao Zuo, Zhongxue Gan, Yuchuan Fan, Vjaceslavs Bobrovs, Xiaodan Pang, Oskars Ozolins

    Abstract: The human brain's synapses have remarkable activity-dependent plasticity, where the connectivity patterns of neurons change dramatically, relying on neuronal activities. As a biologically inspired neural network, reservoir computing (RC) has unique advantages in processing spatiotemporal information. However, typical reservoir architectures only take static random networks into account or consider… ▽ More

    Submitted 25 January, 2023; originally announced January 2023.

    Comments: 13 pages, 7 figures

  50. arXiv:2212.11270  [pdf, other

    cs.CV cs.CL

    Generalized Decoding for Pixel, Image, and Language

    Authors: Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, Jianfeng Gao

    Abstract: We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that… ▽ More

    Submitted 21 December, 2022; originally announced December 2022.

    Comments: https://x-decoder-vl.github.io