Skip to main content

Showing 1–50 of 322 results for author: Ji, R

  1. arXiv:2407.10738  [pdf, other

    cs.CV

    AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

    Authors: Zhihang Lin, Mingbao Lin, Meng Zhao, Rongrong Ji

    Abstract: This paper attempts to address the object repetition issue in patch-wise higher-resolution image generation. We propose AccDiffusion, an accurate method for patch-wise higher-resolution image generation without training. An in-depth analysis in this paper reveals an identical text prompt for different patches causes repeated object generation, while no prompt compromises the image details. Therefo… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 23 pages

  2. arXiv:2407.06794  [pdf, other

    cs.CV

    ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

    Authors: Yunshan Zhong, Jiawei Hu, You Huang, Yuxin Zhang, Rongrong Ji

    Abstract: Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models. However, existing methods typically overlook the intricate interdependence between quantized weight and activation, leading to considerable quantization error. In this paper, we propose ERQ, a two-step PTQ approach meticulously crafted to sequentially redu… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: ICML2024 (Spotlight)

  3. arXiv:2407.05363  [pdf, other

    cs.CV

    Multi-branch Collaborative Learning Network for 3D Visual Grounding

    Authors: Zhipeng Qian, Yiwei Ma, Zhekai Lin, Jiayi Ji, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji

    Abstract: 3D referring expression comprehension (3DREC) and segmentation (3DRES) have overlapping objectives, indicating their potential for collaboration. However, existing collaborative approaches predominantly depend on the results of one task to make predictions for the other, limiting effective collaboration. We argue that employing separate branches for 3DREC and 3DRES tasks enhances the model's capac… ▽ More

    Submitted 10 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: ECCV2024

  4. arXiv:2407.05352  [pdf, other

    cs.CV cs.MM

    Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

    Authors: Danni Yang, Ruohan Dong, Jiayi Ji, Yiwei Ma, Haowei Wang, Xiaoshuai Sun, Rongrong Ji

    Abstract: Recently, diffusion models have increasingly demonstrated their capabilities in vision understanding. By leveraging prompt-based learning to construct sentences, these models have shown proficiency in classification and visual grounding tasks. However, existing approaches primarily showcase their ability to perform sentence-level localization, leaving the potential for leveraging contextual inform… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  5. arXiv:2407.04241  [pdf, other

    cs.CV cs.AI

    AnySR: Realizing Image Super-Resolution as Any-Scale, Any-Resource

    Authors: Wengyi Zhan, Mingbao Lin, Chia-Wen Lin, Rongrong Ji

    Abstract: In an effort to improve the efficiency and scalability of single-image super-resolution (SISR) applications, we introduce AnySR, to rebuild existing arbitrary-scale SR methods into any-scale, any-resource implementation. As a contrast to off-the-shelf methods that solve SR tasks across various scales with the same computing costs, our AnySR innovates in: 1) building arbitrary-scale tasks as any-re… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  6. arXiv:2407.03900  [pdf, other

    cs.CV

    Oracle Bone Inscriptions Multi-modal Dataset

    Authors: Bang Li, Donghao Luo, Yujie Liang, Jing Yang, Zengmao Ding, Xu Peng, Boyuan Jiang, Shengwei Han, Dan Sui, Peichao Qin, Pian Wu, Chaoyang Wang, Yun Qi, Taisong Jin, Chengjie Wang, Xiaoming Huang, Zhan Shu, Rongrong Ji, Yongge Liu, Yunsheng Wu

    Abstract: Oracle bone inscriptions(OBI) is the earliest developed writing system in China, bearing invaluable written exemplifications of early Shang history and paleography. However, the task of deciphering OBI, in the current climate of the scholarship, can prove extremely challenging. Out of the 4,500 oracle bone characters excavated, only a third have been successfully identified. Therefore, leveraging… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  7. arXiv:2407.02483  [pdf, other

    cs.CL cs.AI

    MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

    Authors: Binxu Li, Tiankai Yan, Yuanting Pan, Zhe Xu, Jie Luo, Ruiyang Ji, Shilong Liu, Haoyu Dong, Zihao Lin, Yixin Wang

    Abstract: Multi-Modal Large Language Models (MLLMs), despite being successful, exhibit limited generality and often fall short when compared to specialized models. Recently, LLM-based agents have been developed to address these challenges by selecting appropriate specialized models as tools based on user inputs. However, such advancements have not been extensively explored within the medical domain. To brid… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  8. arXiv:2407.02109  [pdf, other

    cs.CV cs.AI

    HRSAM: Efficiently Segment Anything in High-Resolution Images

    Authors: You Huang, Wenbin Lai, Jiayi Ji, Liujuan Cao, Shengchuan Zhang, Rongrong Ji

    Abstract: The Segment Anything Model (SAM) has significantly advanced interactive segmentation but struggles with high-resolution images crucial for high-precision segmentation. This is primarily due to the quadratic space complexity of SAM-implemented attention and the length extrapolation issue in common global attention. This study proposes HRSAM that integrates Flash Attention and incorporates Plain, Sh… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  9. arXiv:2406.19394  [pdf, other

    cs.CV

    HUWSOD: Holistic Self-training for Unified Weakly Supervised Object Detection

    Authors: Liujuan Cao, Jianghang Lin, Zebo Hong, Yunhang Shen, Shaohui Lin, Chao Chen, Rongrong Ji

    Abstract: Most WSOD methods rely on traditional object proposals to generate candidate regions and are confronted with unstable training, which easily gets stuck in a poor local optimum. In this paper, we introduce a unified, high-capacity weakly supervised object detection (WSOD) network called HUWSOD, which utilizes a comprehensive self-training framework without needing external modules or additional sup… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  10. arXiv:2406.19247  [pdf, other

    cs.CV

    Local Manifold Learning for No-Reference Image Quality Assessment

    Authors: Timin Gao, Wensheng Pan, Yan Zhang, Sicheng Zhao, Shengchuan Zhang, Xiawu Zheng, Ke Li, Liujuan Cao, Rongrong Ji

    Abstract: Contrastive learning has considerably advanced the field of Image Quality Assessment (IQA), emerging as a widely adopted technique. The core mechanism of contrastive learning involves minimizing the distance between quality-similar (positive) examples while maximizing the distance between quality-dissimilar (negative) examples. Despite its successes, current contrastive learning methods often negl… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  11. arXiv:2406.18173  [pdf, other

    cs.CL

    UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs

    Authors: Wenhao Li, Mingbao Lin, Yunshan Zhong, Shuicheng Yan, Rongrong Ji

    Abstract: Managing long texts is challenging for large language models (LLMs) due to limited context window sizes. This study introduces UIO-LLMs, an unbiased incremental optimization approach for memory-enhanced transformers under long-context settings. We initially conceptualize the process as a streamlined encoder-decoder framework where the weights-shared encoder and decoder respectively encapsulate a c… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  12. arXiv:2406.17601  [pdf, other

    cs.CV

    Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text

    Authors: Xinyang Li, Zhangyu Lai, Linning Xu, Yansong Qu, Liujuan Cao, Shengchuan Zhang, Bo Dai, Rongrong Ji

    Abstract: Recent advancements in 3D generation have leveraged synthetic datasets with ground truth 3D assets and predefined cameras. However, the potential of adopting real-world datasets, which can produce significantly more realistic 3D scenes, remains largely unexplored. In this work, we delve into the key challenge of the complex and scene-specific camera trajectories found in real-world captures. We in… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Code: https://github.com/imlixinyang/director3d

  13. arXiv:2406.17413  [pdf, other

    cs.CV

    Depth-Guided Semi-Supervised Instance Segmentation

    Authors: Xin Chen, Jie Hu, Xiawu Zheng, Jianghang Lin, Liujuan Cao, Rongrong Ji

    Abstract: Semi-Supervised Instance Segmentation (SSIS) aims to leverage an amount of unlabeled data during training. Previous frameworks primarily utilized the RGB information of unlabeled images to generate pseudo-labels. However, such a mechanism often introduces unstable noise, as a single instance can display multiple RGB values. To overcome this limitation, we introduce a Depth-Guided (DG) SSIS framewo… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 12 pages, 6 figures, 4 tables

  14. arXiv:2406.16449  [pdf, other

    cs.CV

    Evaluating and Analyzing Relationship Hallucinations in LVLMs

    Authors: Mingrui Wu, Jiayi Ji, Oucheng Huang, Jiale Li, Yuhang Wu, Xiaoshuai Sun, Rongrong Ji

    Abstract: The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Benc… ▽ More

    Submitted 11 July, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: ICML2024; Project Page:https://github.com/mrwu-mac/R-Bench

  15. arXiv:2406.15306  [pdf

    cs.LG cs.CL cs.CV

    Advanced Multimodal Deep Learning Architecture for Image-Text Matching

    Authors: Jinyin Wang, Haijing Zhang, Yihao Zhong, Yingbin Liang, Rongwei Ji, Yiru Cang

    Abstract: Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship. With the advent of the multimedia information age, image, and text data show explosive growth, and how to accurately realize the efficient and accurate semantic correspondence between them has become the core issue of common concern in academia and industry.… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2405.17460 by other authors

  16. arXiv:2406.11432  [pdf, other

    cs.CV cs.AI

    AnyTrans: Translate AnyText in the Image with Large Scale Models

    Authors: Zhipeng Qian, Pei Zhang, Baosong Yang, Kai Fan, Yiwei Ma, Derek F. Wong, Xiaoshuai Sun, Rongrong Ji

    Abstract: This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during tr… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  17. arXiv:2406.10228  [pdf, other

    cs.CV cs.AI cs.CL

    VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

    Authors: Chenyu Zhou, Mengdan Zhang, Peixian Chen, Chaoyou Fu, Yunhang Shen, Xiawu Zheng, Xing Sun, Rongrong Ji

    Abstract: The swift progress of Multi-modal Large Models (MLLMs) has showcased their impressive ability to tackle tasks blending vision and language. Yet, most current models and benchmarks cater to scenarios with a narrow scope of visual and textual contexts. These models often fall short when faced with complex comprehension tasks, which involve navigating through a plethora of irrelevant and potentially… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Project Page: https://zhourax.github.io/VEGA/

  18. arXiv:2406.05620  [pdf, other

    cs.CV

    Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval

    Authors: Yiwei Ma, Xiaoshuai Sun, Jiayi Ji, Guannan Jiang, Weilin Zhuang, Rongrong Ji

    Abstract: Text-based person retrieval (TPR) is a challenging task that involves retrieving a specific individual based on a textual description. Despite considerable efforts to bridge the gap between vision and language, the significant differences between these modalities continue to pose a challenge. Previous methods have attempted to align text and image samples in a modal-shared space, but they face unc… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: ACM MM2023

  19. arXiv:2406.01451  [pdf, other

    cs.CV cs.MM

    SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation

    Authors: Danni Yang, Jiayi Ji, Yiwei Ma, Tianyu Guo, Haowei Wang, Xiaoshuai Sun, Rongrong Ji

    Abstract: In this paper, we introduce SemiRES, a semi-supervised framework that effectively leverages a combination of labeled and unlabeled data to perform RES. A significant hurdle in applying semi-supervised techniques to RES is the prevalence of noisy pseudo-labels, particularly at the boundaries of objects. SemiRES incorporates the Segment Anything Model (SAM), renowned for its precise boundary demarca… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Accepted by ICML2024

  20. arXiv:2406.00334  [pdf, other

    cs.CV

    Image Captioning via Dynamic Path Customization

    Authors: Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Yiyi Zhou, Xiaopeng Hong, Yongjian Wu, Rongrong Ji

    Abstract: This paper explores a novel dynamic network for vision and language tasks, where the inferring structure is customized on the fly for different inputs. Most previous state-of-the-art approaches are static and hand-crafted networks, which not only heavily rely on expert knowledge, but also ignore the semantic diversity of input samples, therefore resulting in suboptimal performance. To address thes… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: TNNLS24

  21. arXiv:2405.21075  [pdf, other

    cs.CV cs.CL

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Authors: Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun

    Abstract: In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality… ▽ More

    Submitted 16 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

    Comments: Project Page: https://video-mme.github.io

  22. arXiv:2405.20851  [pdf, other

    cs.CV

    MegActor: Harness the Power of Raw Video for Vivid Portrait Animation

    Authors: Shurong Yang, Huadong Li, Juhao Wu, Minhao Jing, Linze Li, Renhe Ji, Jiajun Liang, Haoqiang Fan

    Abstract: Despite raw driving videos contain richer information on facial expressions than intermediate representations such as landmarks in the field of portrait animation, they are seldom the subject of research. This is due to two challenges inherent in portrait animation driven with raw videos: 1) significant identity leakage; 2) Irrelevant background and facial details such as wrinkles degrade performa… ▽ More

    Submitted 18 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

  23. arXiv:2405.19914  [pdf, other

    cs.CV

    Towards RGB-NIR Cross-modality Image Registration and Beyond

    Authors: Huadong Li, Shichao Dong, Jin Wang, Rong Fu, Minhao Jing, Jiajun Liang, Haoqiang Fan, Renhe Ji

    Abstract: This paper focuses on the area of RGB(visible)-NIR(near-infrared) cross-modality image registration, which is crucial for many downstream vision tasks to fully leverage the complementary information present in visible and infrared images. In this field, researchers face two primary challenges - the absence of a correctly-annotated benchmark with viewpoint variations for evaluating RGB-NIR cross-mo… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: 18 pages, 7 figures

  24. arXiv:2405.18810  [pdf, other

    cs.CV cs.AI

    UniPTS: A Unified Framework for Proficient Post-Training Sparsity

    Authors: Jingjing Xie, Yuxin Zhang, Mingbao Lin, Zhihang Lin, Liujuan Cao, Rongrong Ji

    Abstract: Post-training Sparsity (PTS) is a recently emerged avenue that chases efficient network sparsity with limited data in need. Existing PTS methods, however, undergo significant performance degradation compared with traditional methods that retrain the sparse networks via the whole dataset, especially at high sparsity ratios. In this paper, we attempt to reconcile this disparity by transposing three… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: Accepted by CVPR2024

  25. arXiv:2405.18706  [pdf, other

    cs.CV

    FocSAM: Delving Deeply into Focused Objects in Segmenting Anything

    Authors: You Huang, Zongyu Lan, Liujuan Cao, Xianming Lin, Shengchuan Zhang, Guannan Jiang, Rongrong Ji

    Abstract: The Segment Anything Model (SAM) marks a notable milestone in segmentation models, highlighted by its robust zero-shot capabilities and ability to handle diverse prompts. SAM follows a pipeline that separates interactive segmentation into image preprocessing through a large encoder and interactive inference via a lightweight decoder, ensuring efficient real-time performance. However, SAM faces sta… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: Accepted to CVPR 2024

  26. arXiv:2405.17596  [pdf, other

    cs.CV

    GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

    Authors: Yansong Qu, Shaohui Dai, Xinyang Li, Jianghang Lin, Liujuan Cao, Shengchuan Zhang, Rongrong Ji

    Abstract: 3D open-vocabulary scene understanding, crucial for advancing augmented reality and robotic applications, involves interpreting and locating specific regions within a 3D space as directed by natural language instructions. To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussia… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Our project page is available at https://goi-hyperplane.github.io/

  27. arXiv:2405.11535  [pdf, ps, other

    cs.PL

    Proving Functional Program Equivalence via Directed Lemma Synthesis

    Authors: Yican Sun, Ruyi Ji, Jian Fang, Xuanlin Jiang, Mingshuai Chen, Yingfei Xiong

    Abstract: Proving equivalence between functional programs is a fundamental problem in program verification, which often amounts to reasoning about algebraic data types (ADTs) and compositions of structural recursions. Modern theorem provers address this problem by applying structural induction, which is insufficient for proving many equivalence theorems. In such cases, one has to invent a set of lemmas, pro… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

    Comments: 21 pages

  28. arXiv:2405.09874  [pdf, other

    cs.CV

    Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

    Authors: Xinyang Li, Zhangyu Lai, Linning Xu, Jianfei Guo, Liujuan Cao, Shengchuan Zhang, Bo Dai, Rongrong Ji

    Abstract: We present Dual3D, a novel text-to-3D generation framework that generates high-quality 3D assets from texts in only $1$ minute.The key component is a dual-mode multi-view latent diffusion model. Given the noisy multi-view latents, the 2D mode can efficiently denoise them with a single latent denoising network, while the 3D mode can generate a tri-plane neural surface for consistent rendering-based… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: Project Page: https://dual3d.github.io

  29. arXiv:2405.09567  [pdf

    eess.SP cs.AI cs.LG

    ECG-SMART-NET: A Deep Learning Architecture for Precise ECG Diagnosis of Occlusion Myocardial Infarction

    Authors: Nathan T. Riek, Murat Akcakaya, Zeineb Bouzid, Tanmay Gokhale, Stephanie Helman, Karina Kraevsky-Philips, Rui Qi Ji, Ervin Sejdic, Jessica K. Zègre-Hemsey, Christian Martin-Gill, Clifton W. Callaway, Samir Saba, Salah Al-Zaiti

    Abstract: In this paper we describe ECG-SMART-NET for identification of occlusion myocardial infarction (OMI). OMI is a severe form of heart attack characterized by complete blockage of one or more coronary arteries requiring immediate referral for cardiac catheterization to restore blood flow to the heart. Two thirds of OMI cases are difficult to visually identify from a 12-lead electrocardiogram (ECG) and… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: 7 pages, 7 figures, 5 tables

  30. arXiv:2405.07478  [pdf, other

    eess.SY

    Coded Event-triggered Control for Nonlinear Systems

    Authors: Ruihang Ji, Shuzhi Sam Ge, Kai Zhao

    Abstract: This paper studies a Coded Event-triggered Control (CEC) for a class of nonlinear systems under any initial condition. To reduce communication burden, the CEC is designed from the encoding-decoding viewpoint by which only $m$-length string is transmitted for each communication between CEC and actuator. If a more general Entry Capture Problem is encountered, such control design will be rather compl… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

  31. arXiv:2405.05803  [pdf, other

    cs.CV cs.AI

    Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

    Authors: Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji

    Abstract: Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention s… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  32. arXiv:2405.00954  [pdf, other

    cs.CV

    X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation

    Authors: Yiwei Ma, Zhekai Lin, Jiayi Ji, Yijun Fan, Xiaoshuai Sun, Rongrong Ji

    Abstract: Recent advancements in automatic 3D avatar generation guided by text have made significant progress. However, existing methods have limitations such as oversaturation and low-quality output. To address these challenges, we propose X-Oscar, a progressive framework for generating high-quality animatable avatars from text prompts. It follows a sequential Geometry->Texture->Animation paradigm, simplif… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: ICML2024

  33. arXiv:2405.00587  [pdf, other

    cs.CV

    GraCo: Granularity-Controllable Interactive Segmentation

    Authors: Yian Zhao, Kehan Li, Zesen Cheng, Pengchong Qiao, Xiawu Zheng, Rongrong Ji, Chang Liu, Li Yuan, Jie Chen

    Abstract: Interactive Segmentation (IS) segments specific objects or parts in the image according to user input. Current IS pipelines fall into two categories: single-granularity output and multi-granularity output. The latter aims to alleviate the spatial ambiguity present in the former. However, the multi-granularity output pipeline suffers from limited interaction flexibility and produces redundant resul… ▽ More

    Submitted 16 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: CVPR2024 Highlight, Project: https://zhao-yian.github.io/GraCo

  34. arXiv:2404.17230  [pdf, other

    cs.CV

    ObjectAdd: Adding Objects into Image via a Training-Free Diffusion Modification Fashion

    Authors: Ziyue Zhang, Mingbao Lin, Rongrong Ji

    Abstract: We introduce ObjectAdd, a training-free diffusion modification method to add user-expected objects into user-specified area. The motive of ObjectAdd stems from: first, describing everything in one prompt can be difficult, and second, users often need to add objects into the generated image. To accommodate with real world, our ObjectAdd maintains accurate image consistency after adding objects with… ▽ More

    Submitted 2 May, 2024; v1 submitted 26 April, 2024; originally announced April 2024.

    Comments: 12 pages

  35. arXiv:2404.16033  [pdf, other

    cs.CV cs.CL

    Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

    Authors: Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, Yan Zhang, Shengchuan Zhang, Xiawu Zheng, Xing Sun, Liujuan Cao, Rongrong Ji

    Abstract: With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determining hallucinations" in decision-making due to insufficient visual information and the limitation of low-… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: The project page is available at https://ggg0919.github.io/cantor/

  36. arXiv:2404.15141  [pdf, other

    cs.CV cs.AI

    CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method

    Authors: Mingbao Lin, Zhihang Lin, Wengyi Zhan, Liujuan Cao, Rongrong Ji

    Abstract: Transforming large pre-trained low-resolution diffusion models to cater to higher-resolution demands, i.e., diffusion extrapolation, significantly improves diffusion adaptability. We propose tuning-free CutDiffusion, aimed at simplifying and accelerating the diffusion extrapolation process, making it more affordable and improving performance. CutDiffusion abides by the existing patch-wise extrapol… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  37. arXiv:2404.14949  [pdf, other

    cs.CV

    Multi-Modal Prompt Learning on Blind Image Quality Assessment

    Authors: Wensheng Pan, Timin Gao, Yan Zhang, Runze Hu, Xiawu Zheng, Enwei Zhang, Yuting Gao, Yutao Liu, Yunhang Shen, Ke Li, Shengchuan Zhang, Liujuan Cao, Rongrong Ji

    Abstract: Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Currently, leveraging semantic information to enhance IQA is a crucial research direction. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semant… ▽ More

    Submitted 18 May, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

  38. arXiv:2404.13921  [pdf, other

    cs.CV

    NeRF-DetS: Enhancing Multi-View 3D Object Detection with Sampling-adaptive Network of Continuous NeRF-based Representation

    Authors: Chi Huang, Xinyang Li, Shengchuan Zhang, Liujuan Cao, Rongrong Ji

    Abstract: As a preliminary work, NeRF-Det unifies the tasks of novel view synthesis and 3D perception, demonstrating that perceptual tasks can benefit from novel view synthesis methods like NeRF, significantly improving the performance of indoor multi-view 3D object detection. Using the geometry MLP of NeRF to direct the attention of detection head to crucial parts and incorporating self-supervised loss fro… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  39. A New Multi-Picture Architecture for Learned Video Deinterlacing and Demosaicing with Parallel Deformable Convolution and Self-Attention Blocks

    Authors: Ronglei Ji, A. Murat Tekalp

    Abstract: Despite the fact real-world video deinterlacing and demosaicing are well-suited to supervised learning from synthetically degraded data because the degradation models are known and fixed, learned video deinterlacing and demosaicing have received much less attention compared to denoising and super-resolution tasks. We propose a new multi-picture architecture for video deinterlacing or demosaicing b… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Comments: 13 pages, 6 figures, accepted to IMAVIS

  40. arXiv:2404.12903  [pdf, other

    cs.MM

    ConCLVD: Controllable Chinese Landscape Video Generation via Diffusion Model

    Authors: Dingming Liu, Shaowei Li, Ruoyan Zhou, Lili Liang, Yongguan Hong, Fei Chao, Rongrong Ji

    Abstract: Chinese landscape painting is a gem of Chinese cultural and artistic heritage that showcases the splendor of nature through the deep observations and imaginations of its painters. Limited by traditional techniques, these artworks were confined to static imagery in ancient times, leaving the dynamism of landscapes and the subtleties of artistic sentiment to the viewer's imagination. Recently, emerg… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  41. arXiv:2404.12410  [pdf

    cond-mat.mtrl-sci

    Semitransparent perovskite solar cells with an evaporated ultra-thin perovskite absorber

    Authors: Zongbao Zhang, Ran Ji, Xiangkun Jia, Shu-Jen Wang, Marielle Deconinck, Elena Siliavka, Yana Vaynzof

    Abstract: Metal halide perovskites are of great interest for application in semitransparent solar cells due to their tunable bandgap and high performance. However, fabricating high-efficiency perovskite semitransparent devices with high average visible transmittance (AVT) is challenging because of their high absorption coefficient. Here, we adopt a co-evaporation process to fabricate ultrathin CsPbI3 perovs… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  42. arXiv:2404.11636  [pdf

    cond-mat.mtrl-sci

    Towards low-temperature processing of efficient $γ$-CsPbI$_3$ perovskite solar cells

    Authors: Zongbao Zhang, Ran Ji, Yvonne J. Hofstetter, Marielle Deconinck, Julius Brunner, Yanxiu Li, Qingzhi An, Yana Vaynzof

    Abstract: Inorganic cesium lead iodide (CsPbI$_3$) perovskite solar cells (PSCs) have attracted enormous attention due to their excellent thermal stability and optical bandgap (~1.73 eV), well-suited for tandem device applications. However, achieving high-performing photovoltaic devices processed at low temperatures is still challenging. Here we reported a new method to fabricate high-efficiency and stable… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  43. arXiv:2404.11264  [pdf

    cond-mat.mtrl-sci

    Perovskite Phase Heterojunction Solar Cells

    Authors: Ran Ji, Zongbao Zhang, Yvonne J. Hofstetter, Robin Buschbeck, Christian Hänisch, Fabian Paulus, Yana Vaynzof

    Abstract: Modern photovoltaic devices are often based on a heterojunction structure where two components with different optoelectronic properties are interfaced. The properties of each side of the junction can be tuned by either utilizing different materials (e.g. donor/acceptor) or doping (e.g. PN Si junction) or even varying their dimensionality (e.g. 3D/2D). In this work we demonstrate the concept of pha… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  44. arXiv:2404.11064  [pdf, other

    cs.CV cs.AI

    Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

    Authors: Yongdong Luo, Haojia Lin, Xiawu Zheng, Yigeng Jiang, Fei Chao, Jie Hu, Guannan Jiang, Songan Zhang, Rongrong Ji

    Abstract: 3D Visual Grounding (3DVG) and 3D Dense Captioning (3DDC) are two crucial tasks in various 3D applications, which require both shared and complementary information in localization and visual-language relationships. Therefore, existing approaches adopt the two-stage "detect-then-describe/discriminate" pipeline, which relies heavily on the performance of the detector, resulting in suboptimal perform… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  45. arXiv:2404.01905  [pdf, other

    astro-ph.SR

    On the surface helium abundance of B-type hot subdwarf stars from the WD+MS channel of Type Ia supernovae

    Authors: Rui-Jie Ji, Xiang-Cun Meng, Zheng-Wei Liu

    Abstract: The origin of intermediate helium (He)-rich hot subdwarfs are still unclear. Previous studies have suggested that some surviving Type Ia supernovae (SNe Ia) companions from the white dwarf~+~main-sequence (WD+MS) channel may contribute to the intermediate He-rich hot subdwarfs. However, previous studies ignored the impact of atomic diffusion on the post-explosion evolution of surviving companion s… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: 10 pages, 5 figures

  46. arXiv:2404.01342  [pdf, other

    cs.CL cs.AI

    DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

    Authors: Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji

    Abstract: Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research. For example, the Civitai community, a platform for T2I innovation, currently hosts an impressive array of 74,492 distinct models. However, this diversity presents a formidable challenge in selecting the most appropriate model and parameters, a process tha… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: Published as a conference paper at CVPR 2024

  47. arXiv:2404.00650  [pdf, other

    cs.CV

    Deep Instruction Tuning for Segment Anything Model

    Authors: Xiaorui Huang, Gen Luo, Chaoyang Zhu, Bo Tong, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji

    Abstract: Recently, Segment Anything Model (SAM) has become a research hotspot in the fields of multimedia and computer vision, which exhibits powerful yet versatile capabilities on various (un) conditional image segmentation tasks. Although SAM can support different types of segmentation prompts, we note that, compared to point- and box-guided segmentations, it performs much worse on text-instructed tasks,… ▽ More

    Submitted 27 April, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

  48. arXiv:2403.18471  [pdf, other

    cs.CV

    DiffusionFace: Towards a Comprehensive Dataset for Diffusion-Based Face Forgery Analysis

    Authors: Zhongxi Chen, Ke Sun, Ziyin Zhou, Xianming Lin, Xiaoshuai Sun, Liujuan Cao, Rongrong Ji

    Abstract: The rapid progress in deep learning has given rise to hyper-realistic facial forgery methods, leading to concerns related to misinformation and security risks. Existing face forgery datasets have limitations in generating high-quality facial images and addressing the challenges posed by evolving generative techniques. To combat this, we present DiffusionFace, the first diffusion-based face forgery… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

  49. arXiv:2403.16169  [pdf, other

    cs.CV

    Gaze-guided Hand-Object Interaction Synthesis: Benchmark and Method

    Authors: Jie Tian, Lingxiao Yang, Ran Ji, Yuexin Ma, Lan Xu, Jingyi Yu, Ye Shi, Jingya Wang

    Abstract: Gaze plays a crucial role in revealing human attention and intention, shedding light on the cognitive processes behind human actions. The integration of gaze guidance with the dynamics of hand-object interactions boosts the accuracy of human motion prediction. However, the lack of datasets that capture the intricate relationship and consistency among gaze, hand, and object movements remains a subs… ▽ More

    Submitted 28 March, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

  50. arXiv:2403.15226  [pdf, other

    cs.MM cs.CL

    Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models

    Authors: Qiong Wu, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji

    Abstract: In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main computational overhead of MLLMs, are often redundant to downstream tasks. Based on this observation, EAS evaluates the attention redundancy and skips the… ▽ More

    Submitted 5 June, 2024; v1 submitted 22 March, 2024; originally announced March 2024.