Skip to main content

Showing 1–50 of 132 results for author: Zang, Y

  1. arXiv:2407.11691  [pdf, other

    cs.CV

    VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

    Authors: Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen

    Abstract: We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  2. arXiv:2407.10328  [pdf, other

    cs.SD cs.AI eess.AS

    The Interpretation Gap in Text-to-Music Generation Models

    Authors: Yongyi Zang, Yixiao Zhang

    Abstract: Large-scale text-to-music generation models have significantly enhanced music creation capabilities, offering unprecedented creative freedom. However, their ability to collaborate effectively with human musicians remains limited. In this paper, we propose a framework to describe the musical interaction process, which includes expression, interpretation, and execution of controls. Following this fr… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: Under review

  3. arXiv:2407.03320  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    Authors: Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao , et al. (2 additional authors not shown)

    Abstract: We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. Th… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Technical Report. https://github.com/InternLM/InternLM-XComposer

  4. arXiv:2407.02165  [pdf, other

    cs.CV

    WildAvatar: Web-scale In-the-wild Video Dataset for 3D Avatar Creation

    Authors: Zihao Huang, Shoukang Hu, Guangcong Wang, Tianqi Liu, Yuhang Zang, Zhiguo Cao, Wei Li, Ziwei Liu

    Abstract: Existing human datasets for avatar creation are typically limited to laboratory environments, wherein high-quality annotations (e.g., SMPL estimation from 3D scans or multi-view images) can be ideally provided. However, their annotating requirements are impractical for real-world images or videos, posing challenges toward real-world applications on current avatar creation methods. To this end, we… ▽ More

    Submitted 14 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

    Comments: Project page: https://wildavatar.github.io/

  5. arXiv:2407.01530  [pdf, other

    eess.IV cs.CV

    xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

    Authors: Tianrun Chen, Chaotao Ding, Lanyun Zhu, Tao Xu, Deyi Ji, Yan Wang, Ying Zang, Zejian Li

    Abstract: Convolutional Neural Networks (CNNs) and Vision Transformers (ViT) have been pivotal in biomedical image segmentation, yet their ability to manage long-range dependencies remains constrained by inherent locality and computational overhead. To overcome these challenges, in this technical report, we first propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (… ▽ More

    Submitted 2 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

  6. arXiv:2407.01523  [pdf, other

    cs.CV cs.CL

    MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

    Authors: Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun

    Abstract: Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark co… ▽ More

    Submitted 10 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

  7. arXiv:2406.18152  [pdf, other

    cs.MA

    Intrinsic Action Tendency Consistency for Cooperative Multi-Agent Reinforcement Learning

    Authors: Junkai Zhang, Yifan Zhang, Xi Sheryl Zhang, Yifan Zang, Jian Cheng

    Abstract: Efficient collaboration in the centralized training with decentralized execution (CTDE) paradigm remains a challenge in cooperative multi-agent systems. We identify divergent action tendencies among agents as a significant obstacle to CTDE's training efficiency, requiring a large number of training samples to achieve a unified consensus on agents' policies. This divergence stems from the lack of a… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: The AAAI-2024 paper with the appendix

  8. arXiv:2406.11833  [pdf, other

    cs.CV cs.AI cs.LG

    MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

    Authors: Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history wit… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: This project is available at https://github.com/Liuziyu77/MMDU

  9. arXiv:2406.11739  [pdf, other

    cs.CV

    V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results

    Authors: Jiaqi Wang, Yuhang Zang, Pan Zhang, Tao Chu, Yuhang Cao, Zeyi Sun, Ziyu Liu, Xiaoyi Dong, Tong Wu, Dahua Lin, Zeming Chen, Zhi Wang, Lingchen Meng, Wenhao Yao, Jianwei Yang, Sihong Wu, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Peixi Wu, Bosong Chai, Xuan Nie, Longquan Yan, Zeyu Wang, Qifan Zhou , et al. (9 additional authors not shown)

    Abstract: Detecting objects in real-world scenes is a complex task due to various challenges, including the vast range of object categories, and potential encounters with previously unknown or unseen objects. The challenges necessitate the development of public benchmarks and challenges to advance the field of object detection. Inspired by the success of previous COCO and LVIS Challenges, we organize the V3… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  10. arXiv:2406.11211  [pdf, other

    cond-mat.mes-hall cond-mat.mtrl-sci cond-mat.supr-con

    Quantized Andreev conductance in semiconductor nanowires

    Authors: Yichun Gao, Wenyu Song, Yuhao Wang, Zuhan Geng, Zhan Cao, Zehao Yu, Shuai Yang, Jiaye Xu, Fangting Chen, Zonglin Li, Ruidong Li, Lining Yang, Zhaoyu Wang, Shan Zhang, Xiao Feng, Tiantian Wang, Yunyi Zang, Lin Li, Dong E. Liu, Runan Shang, Qi-Kun Xue, Ke He, Hao Zhang

    Abstract: Clean one-dimensional electron systems can exhibit quantized conductance. The plateau conductance doubles if the transport is dominated by Andreev reflection. Here, we report quantized conductance observed in both Andreev and normal-state transports in PbTe-Pb and PbTe-In hybrid nanowires. The Andreev plateau is observed at $4e^2/h$, twice of the normal plateau value of $2e^2/h$. In comparison, An… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  11. arXiv:2406.05338  [pdf, other

    cs.CV

    MotionClone: Training-Free Motion Cloning for Controllable Video Generation

    Authors: Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, Yi Jin

    Abstract: Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues or the fine-tuning of video diffusion models. However, these approaches often result in suboptimal motion generation when applied outside the trained domain. In this work, we propose MotionClone, a training-free framewo… ▽ More

    Submitted 28 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: 17 pages, 12 figures, https://bujiazi.github.io/motionclone.github.io/

  12. arXiv:2406.04325  [pdf, other

    cs.CV

    ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

    Authors: Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang

    Abstract: We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating st… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project Page: https://sharegpt4video.github.io/

  13. arXiv:2406.02438  [pdf, other

    eess.AS cs.MM cs.SD

    CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

    Authors: Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan

    Abstract: Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesi… ▽ More

    Submitted 18 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  14. arXiv:2406.00093  [pdf, other

    cs.CV cs.AI cs.GR cs.LG cs.MM

    Bootstrap3D: Improving 3D Content Creation with Synthetic Data

    Authors: Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

    Abstract: Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D assets with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

    Comments: Project Page: https://sunzey.github.io/Bootstrap3D/

  15. arXiv:2405.19326  [pdf, other

    cs.CV cs.GR cs.HC

    Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

    Authors: Tianrun Chen, Chunan Yu, Jing Li, Jianqi Zhang, Lanyun Zhu, Deyi Ji, Yong Zhang, Ying Zang, Zejian Li, Lingyun Sun

    Abstract: In this paper, we introduce a new task: Zero-Shot 3D Reasoning Segmentation for parts searching and localization for objects, which is a new paradigm to 3D segmentation that transcends limitations for previous category-specific 3D semantic segmentation, 3D instance segmentation, and open-vocabulary 3D segmentation. We design a simple baseline method, Reasoning3D, with the capability to understand… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  16. arXiv:2405.16009  [pdf, other

    cs.CV

    Streaming Long Video Understanding with Large Language Models

    Authors: Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, Jiaqi Wang

    Abstract: This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for video understanding, that capably understands arbitrary-length video with a constant number of video tokens streamingly encoded and adaptively selected. The challenge of video understanding in the vision language area mainly lies in the significant computational burden caused by the great number of tokens extrac… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  17. arXiv:2405.13428  [pdf, other

    cs.SD eess.AS

    Ambisonizer: Neural Upmixing as Spherical Harmonics Generation

    Authors: Yongyi Zang, Yifan Wang, Minglun Lee

    Abstract: Neural upmixing, the task of generating immersive music with an increased number of channels from fewer input channels, has been an active research area, with mono-to-stereo and stereo-to-surround upmixing treated as separate problems. In this paper, we propose a unified approach to neural upmixing by formulating it as spherical harmonics - more specifically, Ambisonic generation. We explicitly fo… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Under review

  18. arXiv:2405.12218  [pdf, other

    cs.CV

    MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

    Authors: Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, Ziwei Liu

    Abstract: We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS) that can efficiently reconstruct unseen scenes. Specifically, 1) we leverage MVS to encode geometry-aware Gaussian representations and decode them into Gaussian parameters. 2) To further enhance performance, we propose a hybrid Gaussian rendering that integrates an efficient volume… ▽ More

    Submitted 15 July, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

    Comments: ECCV2024, Project page: https://mvsgaussian.github.io/ , Code: https://github.com/TQTQliu/MVSGaussian

  19. arXiv:2405.05244  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

    Authors: You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

    Abstract: The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specializ… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: Evaluation plan of the SVDD Challenge @ SLT 2024

  20. arXiv:2405.03275  [pdf, ps, other

    math.CO

    Difference ascent sequences and related combinatorial structures

    Authors: Yongchun Zang, Robin D. P. Zhou

    Abstract: Ascent sequences were introduced by Bousquet-Mélou, Claesson, Dukes and Kitaev, which are in bijection with unlabeled $(2+2)$-free posets, Fishburn matrices, permutations avoiding a bivincular pattern of length $3$, and Stoimenow matchings. Analogous results for weak ascent sequences have been obtained by Bényi, Claesson and Dukes. Recently, Dukes and Sagan introduced a more general class of seque… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: 20 pages, 3 figures

    MSC Class: 05A05; 05C30

  21. arXiv:2405.03268  [pdf, ps, other

    math.CO

    On the enumeration of permutations avoiding chains of patterns

    Authors: Robin D. P. Zhou, Yongchun Zang

    Abstract: In 2019, Bóna and Smith introduced the notion of strong pattern avoidance, saying that a permutation $π$ strongly avoids a pattern $σ$ if $π$ and $π^2$ both avoid $σ$. Recently, Archer and Geary generalized the idea of strong pattern avoidance to chain avoidance, in which a permutation $π$ avoids a chain of patterns $(τ^{(1)}:τ^{(2)}:\cdots:τ^{(k)})$ if the $i$-th power of the permutation avoids t… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: 8 pages

    MSC Class: 05A05; 05C30

  22. arXiv:2404.13044  [pdf, other

    cs.CV

    Unified Scene Representation and Reconstruction for 3D Large Language Models

    Authors: Tao Chu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Qiong Liu, Jiaqi Wang

    Abstract: Enabling Large Language Models (LLMs) to interact with 3D environments is challenging. Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models. Text-image aligned 2D features from CLIP are then lifted to point clouds, which serve as inputs for LLMs. However, this solution lacks the establishment of 3D point-to-point connections… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Comments: Project Page: https://chtsy.github.io/uni3drr-page/

  23. arXiv:2404.12652  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

    Authors: Yuan Zang, Tian Yun, Hao Tan, Trung Bui, Chen Sun

    Abstract: Do vision-language models (VLMs) pre-trained to caption an image of a "durian" learn visual concepts such as "brown" (color) and "spiky" (texture) at the same time? We aim to answer this question as visual concepts learned "for free" would enable wide applications such as neuro-symbolic reasoning or human-interpretable object classification. We assume that the visual concepts, if captured by pre-t… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  24. arXiv:2404.06899  [pdf, other

    cond-mat.mes-hall cond-mat.supr-con

    SQUID oscillations in PbTe nanowire networks

    Authors: Yichun Gao, Wenyu Song, Zehao Yu, Shuai Yang, Yuhao Wang, Ruidong Li, Fangting Chen, Zuhan Geng, Lining Yang, Jiaye Xu, Zhaoyu Wang, Zonglin Li, Shan Zhang, Xiao Feng, Tiantian Wang, Yunyi Zang, Lin Li, Runan Shang, Qi-Kun Xue, Ke He, Hao Zhang

    Abstract: Network structures by semiconductor nanowires hold great promise for advanced quantum devices, especially for applications in topological quantum computing. In this study, we created networks of PbTe nanowires arranged in loop configurations. Using shadow-wall epitaxy, we defined superconducting quantum interference devices (SQUIDs) using the superconductor Pb. These SQUIDs exhibit oscillations in… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Journal ref: Phys. Rev. B 110, 045405 (2024)

  25. arXiv:2404.06512  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow reso… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Code and models are publicly available at https://github.com/InternLM/InternLM-XComposer

  26. arXiv:2404.02760  [pdf, other

    cond-mat.mes-hall

    Gate-tunable subband degeneracy in semiconductor nanowires

    Authors: Yuhao Wang, Wenyu Song, Zhan Cao, Zehao Yu, Shuai Yang, Zonglin Li, Yichun Gao, Ruidong Li, Fangting Chen, Zuhan Geng, Lining Yang, Jiaye Xu, Zhaoyu Wang, Shan Zhang, Xiao Feng, Tiantian Wang, Yunyi Zang, Lin Li, Runan Shang, Qi-Kun Xue, Dong E. Liu, Ke He, Hao Zhang

    Abstract: Degeneracy and symmetry have a profound relation in quantum systems. Here, we report gate-tunable subband degeneracy in PbTe nanowires with a nearly symmetric cross-sectional shape. The degeneracy is revealed in electron transport by the absence of a quantized plateau. Utilizing a dual gate design, we can apply an electric field to lift the degeneracy, reflected as emergence of the plateau. This d… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Journal ref: PNAS 121, e2406884121 (2024)

  27. arXiv:2403.20330  [pdf, other

    cs.CV

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Authors: Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

    Abstract: Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomeno… ▽ More

    Submitted 9 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

    Comments: Project page: https://mmstar-benchmark.github.io/

  28. arXiv:2403.17297  [pdf, other

    cs.CL cs.AI

    InternLM2 Technical Report

    Authors: Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang , et al. (75 additional authors not shown)

    Abstract: The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  29. arXiv:2403.15378  [pdf, other

    cs.CV

    Long-CLIP: Unlocking the Long-Text Capability of CLIP

    Authors: Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang

    Abstract: Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective… ▽ More

    Submitted 23 May, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

    Comments: All codes and models are publicly available at https://github.com/beichenzbc/Long-CLIP

  30. arXiv:2403.13805  [pdf, other

    cs.CV cs.AI cs.LG

    RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

    Authors: Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

    Abstract: CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Project: https://github.com/Liuziyu77/RAR

  31. arXiv:2402.05589  [pdf, other

    cs.CV

    RESMatch: Referring Expression Segmentation in a Semi-Supervised Manner

    Authors: Ying Zang, Chenglong Fu, Runlong Cao, Didi Zhu, Min Zhang, Wenjun Hu, Lanyun Zhu, Tianrun Chen

    Abstract: Referring expression segmentation (RES), a task that involves localizing specific instance-level objects based on free-form linguistic descriptions, has emerged as a crucial frontier in human-AI interaction. It demands an intricate understanding of both visual and textual contexts and often requires extensive training data. This paper introduces RESMatch, the first semi-supervised learning (SSL) a… ▽ More

    Submitted 11 February, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

  32. arXiv:2402.04024  [pdf, other

    cond-mat.mes-hall cond-mat.mtrl-sci cond-mat.supr-con

    Epitaxial Indium on PbTe Nanowires for Quantum Devices

    Authors: Zuhan Geng, Fangting Chen, Yichun Gao, Lining Yang, Yuhao Wang, Shuai Yang, Shan Zhang, Zonglin Li, Wenyu Song, Jiaye Xu, Zehao Yu, Ruidong Li, Zhaoyu Wang, Xiao Feng, Tiantian Wang, Yunyi Zang, Lin Li, Runan Shang, Qi-Kun Xue, Ke He, Hao Zhang

    Abstract: Superconductivity in semiconductor nanostructures contains fascinating physics due to the interplay between Andreev reflection, spin, and orbital interactions. New material hybrids can access new quantum regimes and phenomena. Here, we report the realization of epitaxial indium thin films on PbTe nanowires.The film is continuous and forms an atomically sharp interface with PbTe.Tunneling devices r… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

  33. arXiv:2402.02132  [pdf, other

    cond-mat.mes-hall cond-mat.mtrl-sci

    Reducing disorder in PbTe nanowires for Majorana research

    Authors: Wenyu Song, Zehao Yu, Yuhao Wang, Yichun Gao, Zonglin Li, Shuai Yang, Shan Zhang, Zuhan Geng, Ruidong Li, Zhaoyu Wang, Fangting Chen, Lining Yang, Wentao Miao, Jiaye Xu, Xiao Feng, Tiantian Wang, Yunyi Zang, Lin Li, Runan Shang, Qi-Kun Xue, Ke He, Hao Zhang

    Abstract: Material challenges are the key issue in Majorana nanowires where surface disorder constrains device performance. Here, we tackle this challenge by embedding PbTe nanowires within a latticematched crystal, an oxide-free environment. The wire edges are shaped by self-organized growth instead of lithography, resulting in nearly-atomic-flat facets along both cross-sectional and longitudinal direction… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

  34. arXiv:2401.16420  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XCo… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: Code and models are available at https://github.com/InternLM/InternLM-XComposer

  35. arXiv:2401.15914  [pdf, other

    cs.CV cs.AI

    Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

    Authors: Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang

    Abstract: Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distri… ▽ More

    Submitted 15 April, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: ICLR 2024

  36. arXiv:2401.11239  [pdf, other

    cs.CV

    Product-Level Try-on: Characteristics-preserving Try-on with Realistic Clothes Shading and Wrinkles

    Authors: Yanlong Zang, Han Yang, Jiaxu Miao, Yi Yang

    Abstract: Image-based virtual try-on systems,which fit new garments onto human portraits,are gaining research attention.An ideal pipeline should preserve the static features of clothes(like textures and logos)while also generating dynamic elements(e.g.shadows,folds)that adapt to the model's pose and environment.Previous works fail specifically in generating dynamic features,as they preserve the warped in-sh… ▽ More

    Submitted 20 January, 2024; originally announced January 2024.

  37. arXiv:2312.14472  [pdf, other

    cs.AI

    Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing

    Authors: Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng

    Abstract: Multi-task reinforcement learning endeavors to accomplish a set of different tasks with a single policy. To enhance data efficiency by sharing parameters across multiple tasks, a common practice segments the network into distinct modules and trains a routing network to recombine these modules into task-specific policies. However, existing routing approaches employ a fixed number of modules for all… ▽ More

    Submitted 25 January, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

    Comments: AAAI2024, with supplementary material

    Journal ref: 38th AAAI Conference on Artificial Intelligence (AAAI2024), Vancouver, BC, Canada, 2024

  38. arXiv:2312.11188  [pdf, other

    cond-mat.str-el cond-mat.stat-mech quant-ph

    Detecting Quantum Anomalies in Open Systems

    Authors: Yunlong Zang, Yingfei Gu, Shenghan Jiang

    Abstract: Symmetries and quantum anomalies serve as powerful tools for constraining complicated quantum many-body systems, offering valuable insights into low-energy characteristics based on their ultraviolet structure. Nevertheless, their applicability has traditionally been confined to closed quantum systems, rendering them largely unexplored for open quantum systems described by density matrices. In this… ▽ More

    Submitted 14 May, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: v2: 7+7 pages, 3+5 figures; discussion improved, details added, reference updated

  39. arXiv:2312.04435  [pdf, other

    cs.MM

    Deep3DSketch: 3D modeling from Free-hand Sketches with View- and Structural-Aware Adversarial Training

    Authors: Tianrun Chen, Chenglong Fu, Lanyun Zhu, Papa Mao, Jia Zhang, Ying Zang, Lingyun Sun

    Abstract: This work aims to investigate the problem of 3D modeling using single free-hand sketches, which is one of the most natural ways we humans express ideas. Although sketch-based 3D modeling can drastically make the 3D modeling process more accessible, the sparsity and ambiguity of sketches bring significant challenges for creating high-fidelity 3D models that reflect the creators' ideas. In this work… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: ICASSP 2023. arXiv admin note: substantial text overlap with arXiv:2310.18148

  40. arXiv:2312.03818  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

    Authors: Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

    Abstract: Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image, including all the details, even those irrelevant to specific tasks. However, for a finer understanding and controlled editing of images, it becomes crucial to focus on specific reg… ▽ More

    Submitted 13 December, 2023; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: project page: https://aleafy.github.io/alpha-clip code: https://github.com/SunzeY/AlphaCLIP

  41. arXiv:2311.18433  [pdf, other

    cs.CV

    E2PNet: Event to Point Cloud Registration with Spatio-Temporal Representation Learning

    Authors: Xiuhong Lin, Changjie Qiu, Zhipeng Cai, Siqi Shen, Yu Zang, Weiquan Liu, Xuesheng Bian, Matthias Müller, Cheng Wang

    Abstract: Event cameras have emerged as a promising vision sensor in recent years due to their unparalleled temporal resolution and dynamic range. While registration of 2D RGB images to 3D point clouds is a long-standing problem in computer vision, no prior work studies 2D-3D registration for event cameras. To this end, we propose E2PNet, the first learning-based method for event-to-point cloud registration… ▽ More

    Submitted 27 December, 2023; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: 10 pages, 4 figures, accepted by Thirty-seventh Conference on Neural Information Processing Systems(NeurIPS 2023)

  42. arXiv:2311.16815  [pdf, other

    cond-mat.supr-con cond-mat.mes-hall cond-mat.mtrl-sci

    Selective-Area-Grown PbTe-Pb Planar Josephson Junctions for Quantum Devices

    Authors: Ruidong Li, Wenyu Song, Wentao Miao, Zehao Yu, Zhaoyu Wang, Shuai Yang, Yichun Gao, Yuhao Wang, Fangting Chen, Zuhan Geng, Lining Yang, Jiaye Xu, Xiao Feng, Tiantian Wang, Yunyi Zang, Lin Li, Runan Shang, Qi-Kun Xue, Ke He, Hao Zhang

    Abstract: Planar Josephson junctions are predicted to host Majorana zero modes. The material platforms in previous studies are two dimensional electron gases (InAs, InSb, InAsSb and HgTe) coupled to a superconductor such as Al or Nb. Here, we introduce a new material platform for planar JJs, the PbTe-Pb hybrid. The semiconductor, PbTe, was grown as a thin film via selective area epitaxy. The Josephson junct… ▽ More

    Submitted 2 April, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

    Journal ref: Nano Letters (2024)

  43. arXiv:2311.11330  [pdf, ps, other

    math.DG

    Generalized Ricci surfaces

    Authors: Benoît Daniel, Yiming Zang

    Abstract: We consider smooth Riemannian surfaces whose curvature $K$ satisfies the relation $Δ\log|K-c|=aK+b$ away from points where $K=c$ for some $(a,b,c)\in\mathbb{R}^3$, which we call generalized Ricci surfaces. We prove some isometric immersion theorems allowing points where $K=c$ using properties of log-harmonic functions. For instance, we obtain a characterization of Riemannian surfaces that locally… ▽ More

    Submitted 19 November, 2023; originally announced November 2023.

    Comments: 31 pages

    MSC Class: Primary 53C25; 53C42; Secondary 30F45; 53A15

  44. arXiv:2310.18609  [pdf, other

    cs.MM

    Deep3DSketch+: Obtaining Customized 3D Model by Single Free-Hand Sketch through Deep Learning

    Authors: Ying Zang, Chenglong Fu, Tianrun Chen, Yuanqi Hu, Qingshan Liu, Wenjun Hu

    Abstract: As 3D models become critical in today's manufacturing and product design, conventional 3D modeling approaches based on Computer-Aided Design (CAD) are labor-intensive, time-consuming, and have high demands on the creators. This work aims to introduce an alternative approach to 3D modeling by utilizing free-hand sketches to obtain desired 3D models. We introduce Deep3DSketch+, which is a deep-learn… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

  45. arXiv:2310.18178  [pdf, other

    cs.HC

    Deep3DSketch+\+: High-Fidelity 3D Modeling from Single Free-hand Sketches

    Authors: Ying Zang, Chaotao Ding, Tianrun Chen, Papa Mao, Wenjun Hu

    Abstract: The rise of AR/VR has led to an increased demand for 3D content. However, the traditional method of creating 3D content using Computer-Aided Design (CAD) is a labor-intensive and skill-demanding process, making it difficult to use for novice users. Sketch-based 3D modeling provides a promising solution by leveraging the intuitive nature of human-computer interaction. However, generating high-quali… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: Accepted at IEEE SMC 2023

  46. arXiv:2310.18148  [pdf, other

    cs.HC

    Reality3DSketch: Rapid 3D Modeling of Objects from Single Freehand Sketches

    Authors: Tianrun Chen, Chaotao Ding, Lanyun Zhu, Ying Zang, Yiyi Liao, Zejian Li, Lingyun Sun

    Abstract: The emerging trend of AR/VR places great demands on 3D content. However, most existing software requires expertise and is difficult for novice users to use. In this paper, we aim to create sketch-based modeling tools for user-friendly 3D modeling. We introduce Reality3DSketch with a novel application of an immersive 3D modeling experience, in which a user can capture the surrounding scene using a… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: IEEE Transactions on MultiMedia

  47. arXiv:2309.13006  [pdf, other

    cs.CV

    Deep3DSketch+: Rapid 3D Modeling from Single Free-hand Sketches

    Authors: Tianrun Chen, Chenglong Fu, Ying Zang, Lanyun Zhu, Jia Zhang, Papa Mao, Lingyun Sun

    Abstract: The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content re… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

  48. arXiv:2309.09085  [pdf, other

    cs.SD cs.IR cs.MM eess.AS eess.SP

    SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription

    Authors: Yongyi Zang, Yi Zhong, Frank Cwitkowitz, Zhiyao Duan

    Abstract: Guitar tablature is a form of music notation widely used among guitarists. It captures not only the musical content of a piece, but also its implementation and ornamentation on the instrument. Guitar Tablature Transcription (GTT) is an important task with broad applications in music education, composition, and entertainment. Existing GTT datasets are quite limited in size and scope, rendering mode… ▽ More

    Submitted 24 January, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  49. arXiv:2309.07525  [pdf, other

    cs.SD cs.AI eess.AS

    SingFake: Singing Voice Deepfake Detection

    Authors: Yongyi Zang, You Zhang, Mojtaba Heydari, Zhiyao Duan

    Abstract: The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances.… ▽ More

    Submitted 21 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  50. Ballistic PbTe Nanowire Devices

    Authors: Yuhao Wang, Fangting Chen, Wenyu Song, Zuhan Geng, Zehao Yu, Lining Yang, Yichun Gao, Ruidong Li, Shuai Yang, Wentao Miao, Wei Xu, Zhaoyu Wang, Zezhou Xia, Huading Song, Xiao Feng, Yunyi Zang, Lin Li, Runan Shang, Qi-Kun Xue, Ke He, Hao Zhang

    Abstract: Disorder is the primary obstacle in current Majorana nanowire experiments. Reducing disorder or achieving ballistic transport is thus of paramount importance. In clean and ballistic nanowire devices, quantized conductance is expected with plateau quality serving as a benchmark for disorder assessment. Here, we introduce ballistic PbTe nanowire devices grown using the selective-area-growth (SAG) te… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Journal ref: Nano Letters (2023)