Skip to main content

Showing 1–50 of 69 results for author: Zang, Y

  1. arXiv:2407.03320  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    Authors: Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao , et al. (2 additional authors not shown)

    Abstract: We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. Th… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Technical Report. https://github.com/InternLM/InternLM-XComposer

  2. arXiv:2407.02165  [pdf, other

    cs.CV

    WildAvatar: Web-scale In-the-wild Video Dataset for 3D Avatar Creation

    Authors: Zihao Huang, ShouKang Hu, Guangcong Wang, Tianqi Liu, Yuhang Zang, Zhiguo Cao, Wei Li, Ziwei Liu

    Abstract: Existing human datasets for avatar creation are typically limited to laboratory environments, wherein high-quality annotations (e.g., SMPL estimation from 3D scans or multi-view images) can be ideally provided. However, their annotating requirements are impractical for real-world images or videos, posing challenges toward real-world applications on current avatar creation methods. To this end, we… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  3. arXiv:2407.01530  [pdf, other

    eess.IV cs.CV

    xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

    Authors: Tianrun Chen, Chaotao Ding, Lanyun Zhu, Tao Xu, Deyi Ji, Yan Wang, Ying Zang, Zejian Li

    Abstract: Convolutional Neural Networks (CNNs) and Vision Transformers (ViT) have been pivotal in biomedical image segmentation, yet their ability to manage long-range dependencies remains constrained by inherent locality and computational overhead. To overcome these challenges, in this technical report, we first propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (… ▽ More

    Submitted 2 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

  4. arXiv:2407.01523  [pdf, other

    cs.CV cs.CL

    MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

    Authors: Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun

    Abstract: Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark co… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  5. arXiv:2406.18152  [pdf, other

    cs.MA

    Intrinsic Action Tendency Consistency for Cooperative Multi-Agent Reinforcement Learning

    Authors: Junkai Zhang, Yifan Zhang, Xi Sheryl Zhang, Yifan Zang, Jian Cheng

    Abstract: Efficient collaboration in the centralized training with decentralized execution (CTDE) paradigm remains a challenge in cooperative multi-agent systems. We identify divergent action tendencies among agents as a significant obstacle to CTDE's training efficiency, requiring a large number of training samples to achieve a unified consensus on agents' policies. This divergence stems from the lack of a… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: The AAAI-2024 paper with the appendix

  6. arXiv:2406.11833  [pdf, other

    cs.CV cs.AI cs.LG

    MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

    Authors: Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history wit… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: This project is available at https://github.com/Liuziyu77/MMDU

  7. arXiv:2406.11739  [pdf, other

    cs.CV

    V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results

    Authors: Jiaqi Wang, Yuhang Zang, Pan Zhang, Tao Chu, Yuhang Cao, Zeyi Sun, Ziyu Liu, Xiaoyi Dong, Tong Wu, Dahua Lin, Zeming Chen, Zhi Wang, Lingchen Meng, Wenhao Yao, Jianwei Yang, Sihong Wu, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Peixi Wu, Bosong Chai, Xuan Nie, Longquan Yan, Zeyu Wang, Qifan Zhou , et al. (9 additional authors not shown)

    Abstract: Detecting objects in real-world scenes is a complex task due to various challenges, including the vast range of object categories, and potential encounters with previously unknown or unseen objects. The challenges necessitate the development of public benchmarks and challenges to advance the field of object detection. Inspired by the success of previous COCO and LVIS Challenges, we organize the V3… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  8. arXiv:2406.05338  [pdf, other

    cs.CV

    MotionClone: Training-Free Motion Cloning for Controllable Video Generation

    Authors: Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, Yi Jin

    Abstract: Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues or the fine-tuning of video diffusion models. However, these approaches often result in suboptimal motion generation when applied outside the trained domain. In this work, we propose MotionClone, a training-free framewo… ▽ More

    Submitted 28 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: 17 pages, 12 figures, https://bujiazi.github.io/motionclone.github.io/

  9. arXiv:2406.04325  [pdf, other

    cs.CV

    ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

    Authors: Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang

    Abstract: We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating st… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project Page: https://sharegpt4video.github.io/

  10. arXiv:2406.02438  [pdf, other

    eess.AS cs.MM cs.SD

    CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

    Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan

    Abstract: Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi… ▽ More

    Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  11. arXiv:2406.00093  [pdf, other

    cs.CV cs.AI cs.GR cs.LG cs.MM

    Bootstrap3D: Improving 3D Content Creation with Synthetic Data

    Authors: Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

    Abstract: Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D assets with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

    Comments: Project Page: https://sunzey.github.io/Bootstrap3D/

  12. arXiv:2405.19326  [pdf, other

    cs.CV cs.GR cs.HC

    Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

    Authors: Tianrun Chen, Chunan Yu, Jing Li, Jianqi Zhang, Lanyun Zhu, Deyi Ji, Yong Zhang, Ying Zang, Zejian Li, Lingyun Sun

    Abstract: In this paper, we introduce a new task: Zero-Shot 3D Reasoning Segmentation for parts searching and localization for objects, which is a new paradigm to 3D segmentation that transcends limitations for previous category-specific 3D semantic segmentation, 3D instance segmentation, and open-vocabulary 3D segmentation. We design a simple baseline method, Reasoning3D, with the capability to understand… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  13. arXiv:2405.16009  [pdf, other

    cs.CV

    Streaming Long Video Understanding with Large Language Models

    Authors: Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, Jiaqi Wang

    Abstract: This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for video understanding, that capably understands arbitrary-length video with a constant number of video tokens streamingly encoded and adaptively selected. The challenge of video understanding in the vision language area mainly lies in the significant computational burden caused by the great number of tokens extrac… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  14. arXiv:2405.13428  [pdf, other

    cs.SD eess.AS

    Ambisonizer: Neural Upmixing as Spherical Harmonics Generation

    Authors: Yongyi Zang, Yifan Wang, Minglun Lee

    Abstract: Neural upmixing, the task of generating immersive music with an increased number of channels from fewer input channels, has been an active research area, with mono-to-stereo and stereo-to-surround upmixing treated as separate problems. In this paper, we propose a unified approach to neural upmixing by formulating it as spherical harmonics - more specifically, Ambisonic generation. We explicitly fo… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Under review

  15. arXiv:2405.12218  [pdf, other

    cs.CV

    Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

    Authors: Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, Ziwei Liu

    Abstract: We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS) that can efficiently reconstruct unseen scenes. Specifically, 1) we leverage MVS to encode geometry-aware Gaussian representations and decode them into Gaussian parameters. 2) To further enhance performance, we propose a hybrid Gaussian rendering that integrates an efficient volume… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

    Comments: Project page: https://mvsgaussian.github.io/

  16. arXiv:2405.05244  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

    Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

    Abstract: The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: Evaluation plan of the SVDD Challenge @ SLT 2024

  17. arXiv:2404.13044  [pdf, other

    cs.CV

    Unified Scene Representation and Reconstruction for 3D Large Language Models

    Authors: Tao Chu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Qiong Liu, Jiaqi Wang

    Abstract: Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Comments: Project Page: https://chtsy.github.io/uni3drr-page/

  18. arXiv:2404.12652  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

    Authors: Yuan Zang, Tian Yun, Hao Tan, Trung Bui, Chen Sun

    Abstract: Do vision-language models (VLMs) pre-trained to caption an image of a "durian" learn visual concepts such as "brown" (color) and "spiky" (texture) at the same time? We aim to answer this question as visual concepts learned "for free" would enable wide applications such as neuro-symbolic reasoning or human-interpretable object classification. We assume that the visual concepts, if captured by pre-t… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  19. arXiv:2404.06512  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow reso… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Code and models are publicly available at https://github.com/InternLM/InternLM-XComposer

  20. arXiv:2403.20330  [pdf, other

    cs.CV

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Authors: Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

    Abstract: Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomeno… ▽ More

    Submitted 9 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

    Comments: Project page: https://mmstar-benchmark.github.io/

  21. arXiv:2403.17297  [pdf, other

    cs.CL cs.AI

    InternLM2 Technical Report

    Authors: Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang , et al. (75 additional authors not shown)

    Abstract: The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  22. arXiv:2403.15378  [pdf, other

    cs.CV

    Long-CLIP: Unlocking the Long-Text Capability of CLIP

    Authors: Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang

    Abstract: Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective… ▽ More

    Submitted 23 May, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

    Comments: All codes and models are publicly available at https://github.com/beichenzbc/Long-CLIP

  23. arXiv:2403.13805  [pdf, other

    cs.CV cs.AI cs.LG

    RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

    Authors: Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

    Abstract: CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Project: https://github.com/Liuziyu77/RAR

  24. arXiv:2402.05589  [pdf, other

    cs.CV

    RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner

    Authors: Ying Zang, Chenglong Fu, Runlong Cao, Didi Zhu, Min Zhang, Wenjun Hu, Lanyun Zhu, Tianrun Chen

    Abstract: Referring expression segmentation (RES), a task that involves localizing specific instance-level objects based on free-form linguistic descriptions, has emerged as a crucial frontier in human-AI interaction. It demands an intricate understanding of both visual and textual contexts and often requires extensive training data. This paper introduces RESMatch, the first semi-supervised learning (SSL) a… ▽ More

    Submitted 11 February, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

  25. arXiv:2401.16420  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XCo… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: Code and models are available at https://github.com/InternLM/InternLM-XComposer

  26. arXiv:2401.15914  [pdf, other

    cs.CV cs.AI

    Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

    Authors: Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang

    Abstract: Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distri… ▽ More

    Submitted 15 April, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: ICLR 2024

  27. arXiv:2401.11239  [pdf, other

    cs.CV

    Product-Level Try-on: Characteristics-preserving Try-on with Realistic Clothes Shading and Wrinkles

    Authors: Yanlong Zang, Han Yang, Jiaxu Miao, Yi Yang

    Abstract: Image-based virtual try-on systems,which fit new garments onto human portraits,are gaining research attention.An ideal pipeline should preserve the static features of clothes(like textures and logos)while also generating dynamic elements(e.g.shadows,folds)that adapt to the model's pose and environment.Previous works fail specifically in generating dynamic features,as they preserve the warped in-sh… ▽ More

    Submitted 20 January, 2024; originally announced January 2024.

  28. arXiv:2312.14472  [pdf, other

    cs.AI

    Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing

    Authors: Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng

    Abstract: Multi-task reinforcement learning endeavors to accomplish a set of different tasks with a single policy. To enhance data efficiency by sharing parameters across multiple tasks, a common practice segments the network into distinct modules and trains a routing network to recombine these modules into task-specific policies. However, existing routing approaches employ a fixed number of modules for all… ▽ More

    Submitted 25 January, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

    Comments: AAAI2024, with supplementary material

    Journal ref: 38th AAAI Conference on Artificial Intelligence (AAAI2024), Vancouver, BC, Canada, 2024

  29. arXiv:2312.04435  [pdf, other

    cs.MM

    Deep3DSketch: 3D modeling from Free-hand Sketches with View- and Structural-Aware Adversarial Training

    Authors: Tianrun Chen, Chenglong Fu, Lanyun Zhu, Papa Mao, Jia Zhang, Ying Zang, Lingyun Sun

    Abstract: This work aims to investigate the problem of 3D modeling using single free-hand sketches, which is one of the most natural ways we humans express ideas. Although sketch-based 3D modeling can drastically make the 3D modeling process more accessible, the sparsity and ambiguity of sketches bring significant challenges for creating high-fidelity 3D models that reflect the creators' ideas. In this work… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: ICASSP 2023. arXiv admin note: substantial text overlap with arXiv:2310.18148

  30. arXiv:2312.03818  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

    Authors: Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

    Abstract: Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image, including all the details, even those irrelevant to specific tasks. However, for a finer understanding and controlled editing of images, it becomes crucial to focus on specific reg… ▽ More

    Submitted 13 December, 2023; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: project page: https://aleafy.github.io/alpha-clip code: https://github.com/SunzeY/AlphaCLIP

  31. arXiv:2311.18433  [pdf, other

    cs.CV

    E2PNet: Event to Point Cloud Registration with Spatio-Temporal Representation Learning

    Authors: Xiuhong Lin, Changjie Qiu, Zhipeng Cai, Siqi Shen, Yu Zang, Weiquan Liu, Xuesheng Bian, Matthias Müller, Cheng Wang

    Abstract: Event cameras have emerged as a promising vision sensor in recent years due to their unparalleled temporal resolution and dynamic range. While registration of 2D RGB images to 3D point clouds is a long-standing problem in computer vision, no prior work studies 2D-3D registration for event cameras. To this end, we propose E2PNet, the first learning-based method for event-to-point cloud registration… ▽ More

    Submitted 27 December, 2023; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: 10 pages, 4 figures, accepted by Thirty-seventh Conference on Neural Information Processing Systems(NeurIPS 2023)

  32. arXiv:2310.18609  [pdf, other

    cs.MM

    Deep3DSketch+: Obtaining Customized 3D Model by Single Free-Hand Sketch through Deep Learning

    Authors: Ying Zang, Chenglong Fu, Tianrun Chen, Yuanqi Hu, Qingshan Liu, Wenjun Hu

    Abstract: As 3D models become critical in today's manufacturing and product design, conventional 3D modeling approaches based on Computer-Aided Design (CAD) are labor-intensive, time-consuming, and have high demands on the creators. This work aims to introduce an alternative approach to 3D modeling by utilizing free-hand sketches to obtain desired 3D models. We introduce Deep3DSketch+, which is a deep-learn… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

  33. arXiv:2310.18178  [pdf, other

    cs.HC

    Deep3DSketch+\+: High-Fidelity 3D Modeling from Single Free-hand Sketches

    Authors: Ying Zang, Chaotao Ding, Tianrun Chen, Papa Mao, Wenjun Hu

    Abstract: The rise of AR/VR has led to an increased demand for 3D content. However, the traditional method of creating 3D content using Computer-Aided Design (CAD) is a labor-intensive and skill-demanding process, making it difficult to use for novice users. Sketch-based 3D modeling provides a promising solution by leveraging the intuitive nature of human-computer interaction. However, generating high-quali… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: Accepted at IEEE SMC 2023

  34. arXiv:2310.18148  [pdf, other

    cs.HC

    Reality3DSketch: Rapid 3D Modeling of Objects from Single Freehand Sketches

    Authors: Tianrun Chen, Chaotao Ding, Lanyun Zhu, Ying Zang, Yiyi Liao, Zejian Li, Lingyun Sun

    Abstract: The emerging trend of AR/VR places great demands on 3D content. However, most existing software requires expertise and is difficult for novice users to use. In this paper, we aim to create sketch-based modeling tools for user-friendly 3D modeling. We introduce Reality3DSketch with a novel application of an immersive 3D modeling experience, in which a user can capture the surrounding scene using a… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: IEEE Transactions on MultiMedia

  35. arXiv:2309.13006  [pdf, other

    cs.CV

    Deep3DSketch+: Rapid 3D Modeling from Single Free-hand Sketches

    Authors: Tianrun Chen, Chenglong Fu, Ying Zang, Lanyun Zhu, Jia Zhang, Papa Mao, Lingyun Sun

    Abstract: The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content re… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

  36. arXiv:2309.09085  [pdf, other

    cs.SD cs.IR cs.MM eess.AS eess.SP

    SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription

    Authors: Yongyi Zang, Yi Zhong, Frank Cwitkowitz, Zhiyao Duan

    Abstract: Guitar tablature is a form of music notation widely used among guitarists. It captures not only the musical content of a piece, but also its implementation and ornamentation on the instrument. Guitar Tablature Transcription (GTT) is an important task with broad applications in music education, composition, and entertainment. Existing GTT datasets are quite limited in size and scope, rendering mode… ▽ More

    Submitted 24 January, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  37. arXiv:2309.07525  [pdf, other

    cs.SD cs.AI eess.AS

    SingFake: Singing Voice Deepfake Detection

    Authors: Yongyi Zang, You Zhang, Mojtaba Heydari, Zhiyao Duan

    Abstract: The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances.… ▽ More

    Submitted 21 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  38. arXiv:2308.16532  [pdf, other

    cs.CV

    Decoupled Local Aggregation for Point Cloud Learning

    Authors: Binjie Chen, Yunzhou Xia, Yu Zang, Cheng Wang, Jonathan Li

    Abstract: The unstructured nature of point clouds demands that local aggregation be adaptive to different local structures. Previous methods meet this by explicitly embedding spatial relations into each aggregation process. Although this coupled approach has been shown effective in generating clear semantics, aggregation can be greatly slowed down due to repeated relation learning and redundant computation… ▽ More

    Submitted 31 August, 2023; originally announced August 2023.

  39. arXiv:2307.15967  [pdf, other

    cs.LG cs.AI

    Graph Condensation for Inductive Node Representation Learning

    Authors: Xinyi Gao, Tong Chen, Yilong Zang, Wentao Zhang, Quoc Viet Hung Nguyen, Kai Zheng, Hongzhi Yin

    Abstract: Graph neural networks (GNNs) encounter significant computational challenges when handling large-scale graphs, which severely restricts their efficacy across diverse applications. To address this limitation, graph condensation has emerged as a promising technique, which constructs a small synthetic graph for efficiently training GNNs while retaining performance. However, due to the topology structu… ▽ More

    Submitted 9 December, 2023; v1 submitted 29 July, 2023; originally announced July 2023.

    Comments: 2024 IEEE 40th International Conference on Data Engineering (ICDE)

  40. Phase perturbation improves channel robustness for speech spoofing countermeasures

    Authors: Yongyi Zang, You Zhang, Zhiyao Duan

    Abstract: In this paper, we aim to address the problem of channel robustness in speech countermeasure (CM) systems, which are used to distinguish synthetic speech from human natural speech. On the basis of two hypotheses, we suggest an approach for perturbing phase information during the training of time-domain CM systems. Communication networks often employ lossy compression codec that encodes only magnitu… ▽ More

    Submitted 6 October, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

    Comments: 5 pages; Proceedings of Interspeech 2023

  41. arXiv:2305.18279  [pdf, other

    cs.CV cs.AI

    Contextual Object Detection with Multimodal Large Language Models

    Authors: Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, Chen Change Loy

    Abstract: Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection -- understanding visible objects within different human-AI interactive contexts. Th… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: Github: https://github.com/yuhangzang/ContextDET, Project Page: https://www.mmlab-ntu.com/project/contextdet/index.html

  42. arXiv:2305.14813  [pdf, other

    cs.CV

    Semi-Supervised and Long-Tailed Object Detection with CascadeMatch

    Authors: Yuhang Zang, Kaiyang Zhou, Chen Huang, Chen Change Loy

    Abstract: This paper focuses on long-tailed object detection in the semi-supervised learning setting, which poses realistic challenges, but has rarely been studied in the literature. We propose a novel pseudo-labeling-based detector called CascadeMatch. Our detector features a cascade network architecture, which has multi-stage detection heads with progressive confidence thresholds. To avoid manually tuning… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: International Journal of Computer Vision (IJCV), 2023

  43. arXiv:2305.12433  [pdf, other

    cs.LG math.NA

    ParticleWNN: a Novel Neural Networks Framework for Solving Partial Differential Equations

    Authors: Yaohua Zang, Gang Bao

    Abstract: Deep neural networks (DNNs) have been widely used to solve partial differential equations (PDEs) in recent years. In this work, a novel deep learning-based framework named Particle Weak-form based Neural Networks (ParticleWNN) is developed for solving PDEs in the weak form. In this framework, the trial space is defined as the space of DNNs, while the test space consists of functions compactly supp… ▽ More

    Submitted 12 November, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

  44. arXiv:2304.09148  [pdf, other

    cs.CV

    SAM Fails to Segment Anything? -- SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More

    Authors: Tianrun Chen, Lanyun Zhu, Chaotao Ding, Runlong Cao, Yan Wang, Zejian Li, Lingyun Sun, Papa Mao, Ying Zang

    Abstract: The emergence of large models, also known as foundation models, has brought significant advancements to AI research. One such model is Segment Anything (SAM), which is designed for image segmentation tasks. However, as with other foundation models, our experimental findings suggest that SAM may fail or perform poorly in certain segmentation tasks, such as shadow detection and camouflaged object de… ▽ More

    Submitted 2 May, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

  45. arXiv:2304.04200  [pdf, other

    cs.CV cs.RO

    DSMNet: Deep High-precision 3D Surface Modeling from Sparse Point Cloud Frames

    Authors: Changjie Qiu, Zhiyong Wang, Xiuhong Lin, Yu Zang, Cheng Wang, Weiquan Liu

    Abstract: Existing point cloud modeling datasets primarily express the modeling precision by pose or trajectory precision rather than the point cloud modeling effect itself. Under this demand, we first independently construct a set of LiDAR system with an optical stage, and then we build a HPMB dataset based on the constructed LiDAR system, a High-Precision, Multi-Beam, real-world dataset. Second, we propos… ▽ More

    Submitted 9 April, 2023; originally announced April 2023.

    Comments: To be published in IEEE Geoscience and Remote Sensing Letters (GRSL)

  46. arXiv:2302.10270  [pdf

    cs.CV cs.LG

    Crop mapping in the small sample/no sample case: an approach using a two-level cascade classifier and integrating domain knowledge

    Authors: Yunze Zang, Yifei Liu, Xuehong Chen, Anqi Li, Yichen Zhai, Shijie Li, Luling Liu, Chuanhai Zhu, Ruilin Chen, Shupeng Li, Na Jie

    Abstract: Mapping crops using remote sensing technology is important for food security and land management. Machine learning-based methods has become a popular approach for crop mapping in recent years. However, the key to machine learning, acquiring ample and accurate samples, is usually time-consuming and laborious. To solve this problem, a crop mapping method in the small sample/no sample case that integ… ▽ More

    Submitted 26 December, 2022; originally announced February 2023.

    Comments: in Chinese language

  47. arXiv:2210.07225  [pdf, other

    cs.CV cs.AI

    Unified Vision and Language Prompt Learning

    Authors: Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, Chen Change Loy

    Abstract: Prompt tuning, a parameter- and data-efficient transfer learning paradigm that tunes only a small number of parameters in a model's input space, has become a trend in the vision community since the emergence of large vision-language models like CLIP. We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning. A major finding is tha… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

  48. arXiv:2209.07521  [pdf, other

    cs.CV cs.AI cs.LG

    On-Device Domain Generalization

    Authors: Kaiyang Zhou, Yuanhan Zhang, Yuhang Zang, Jingkang Yang, Chen Change Loy, Ziwei Liu

    Abstract: We present a systematic study of domain generalization (DG) for tiny neural networks. This problem is critical to on-device machine learning applications but has been overlooked in the literature where research has been merely focused on large models. Tiny neural networks have much fewer parameters and lower complexity and therefore should not be trained the same way as their large counterparts fo… ▽ More

    Submitted 7 November, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

    Comments: Preprint

  49. Open-Vocabulary DETR with Conditional Matching

    Authors: Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, Chen Change Loy

    Abstract: Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and use… ▽ More

    Submitted 29 November, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

    Comments: ECCV 2022 Oral

  50. arXiv:2110.06088  [pdf, other

    cs.SI cs.AI cs.LG

    ConTIG: Continuous Representation Learning on Temporal Interaction Graphs

    Authors: Xu Yan, Xiaoliang Fan, Peizhen Yang, Zonghan Wu, Shirui Pan, Longbiao Chen, Yu Zang, Cheng Wang

    Abstract: Representation learning on temporal interaction graphs (TIG) is to model complex networks with the dynamic evolution of interactions arising in a broad spectrum of problems. Existing dynamic embedding methods on TIG discretely update node embeddings merely when an interaction occurs. They fail to capture the continuous dynamic evolution of embedding trajectories of nodes. In this paper, we propose… ▽ More

    Submitted 27 September, 2021; originally announced October 2021.

    Comments: 12 pages; 6 figures