Skip to main content

Showing 1–50 of 77 results for author: Xie, E

  1. arXiv:2403.16996  [pdf, other

    cs.CV cs.RO

    DriveCoT: Integrating Chain-of-Thought Reasoning with End-to-End Driving

    Authors: Tianqi Wang, Enze Xie, Ruihang Chu, Zhenguo Li, Ping Luo

    Abstract: End-to-end driving has made significant progress in recent years, demonstrating benefits such as system simplicity and competitive driving performance under both open-loop and closed-loop settings. Nevertheless, the lack of interpretability and controllability in its driving decisions hinders real-world deployment for end-to-end driving systems. In this paper, we collect a comprehensive end-to-end… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  2. arXiv:2403.13807  [pdf, other

    cs.CV cs.LG

    Editing Massive Concepts in Text-to-Image Diffusion Models

    Authors: Tianwei Xiong, Yue Wu, Enze Xie, Yue Wu, Zhenguo Li, Xihui Liu

    Abstract: Text-to-image diffusion models suffer from the risk of generating outdated, copyrighted, incorrect, and biased content. While previous methods have mitigated the issues on a small scale, it is essential to handle them simultaneously in larger-scale real-world scenarios. We propose a two-stage method, Editing Massive Concepts In Diffusion Models (EMCID). The first stage performs memory optimization… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Project page: https://silentview.github.io/EMCID/ . Code: https://github.com/SilentView/EMCID

  3. arXiv:2403.10047  [pdf, other

    cs.CV

    TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

    Authors: Jiahao Lyu, Jin Wei, Gangyan Zeng, Zeng Li, Enze Xie, Wei Wang, Yu Zhou

    Abstract: Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precis… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: 12 pages, 8 figures

  4. arXiv:2403.04692  [pdf, other

    cs.CV

    PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

    Authors: Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li

    Abstract: In this paper, we introduce PixArt-Σ, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-Σrepresents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σis its training efficiency. Leveraging the foundational pre-training of PixArt-α,… ▽ More

    Submitted 17 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: Project Page: https://pixart-alpha.github.io/PixArt-sigma-project/

  5. arXiv:2402.17376  [pdf, other

    cs.CV cs.AI cs.LG

    Accelerating Diffusion Sampling with Optimized Time Steps

    Authors: Shuchen Xue, Zhaoqiang Liu, Fei Chen, Shifeng Zhang, Tianyang Hu, Enze Xie, Zhenguo Li

    Abstract: Diffusion probabilistic models (DPMs) have shown remarkable performance in high-resolution image synthesis, but their sampling efficiency is still to be desired due to the typically large number of sampling steps. Recent advancements in high-order numerical ODE solvers for DPMs have enabled the generation of high-quality images with much fewer sampling steps. While this is a significant developmen… ▽ More

    Submitted 3 July, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

    Comments: CVPR 2024

  6. arXiv:2402.13572  [pdf, other

    cs.LG cs.AI math.NA

    On the Expressive Power of a Variant of the Looped Transformer

    Authors: Yihang Gao, Chuanyang Zheng, Enze Xie, Han Shi, Tianyang Hu, Yu Li, Michael K. Ng, Zhenguo Li, Zhaoqiang Liu

    Abstract: Besides natural language processing, transformers exhibit extraordinary performance in solving broader applications, including scientific computing and computer vision. Previous works try to explain this from the expressive power and capability perspectives that standard transformers are capable of performing some algorithms. To empower transformers with algorithmic capabilities and motivated by t… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

  7. arXiv:2401.15688  [pdf, other

    cs.CV

    Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

    Authors: Zhenyu Wang, Enze Xie, Aoxue Li, Zhongdao Wang, Xihui Liu, Zhenguo Li

    Abstract: Despite significant advancements in text-to-image models for generating high-quality images, these methods still struggle to ensure the controllability of text prompts over images in the context of complex text prompts, especially when it comes to retaining object attributes and relationships. In this paper, we propose CompAgent, a training-free approach for compositional text-to-image generation,… ▽ More

    Submitted 30 January, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

  8. arXiv:2401.05252  [pdf, other

    cs.CV

    PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models

    Authors: Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, Zhenguo Li

    Abstract: This technical report introduces PIXART-δ, a text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-α model. PIXART-α is recognized for its ability to generate high-quality images of 1024px resolution through a remarkably efficient training process. The integration of LCM in PIXART-δ significantly accelerates the inference speed… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

    Comments: Technical Report

  9. arXiv:2312.15856  [pdf, other

    cs.GR cs.CV

    SERF: Fine-Grained Interactive 3D Segmentation and Editing with Radiance Fields

    Authors: Kaichen Zhou, Lanqing Hong, Enze Xie, Yongxin Yang, Zhenguo Li, Wei Zhang

    Abstract: Although significant progress has been made in the field of 2D-based interactive editing, fine-grained 3D-based interactive editing remains relatively unexplored. This limitation can be attributed to two main challenges: the lack of an efficient 3D representation robust to different modifications and the absence of an effective 3D interactive segmentation method. In this paper, we introduce a nove… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

  10. arXiv:2312.11562  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    A Survey of Reasoning with Foundation Models

    Authors: Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, Yue Wu, Wenhai Wang, Junsong Chen, Zhangyue Yin, Xiaozhe Ren, Jie Fu, Junxian He, Wu Yuan, Qi Liu, Xihui Liu, Yu Li, Hao Dong, Yu Cheng, Ming Zhang, Pheng Ann Heng , et al. (9 additional authors not shown)

    Abstract: Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, e.g., Large Language Models (LLMs), there is a growing interest in exploring… ▽ More

    Submitted 25 January, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

    Comments: 20 Figures, 160 Pages, 750+ References, Project Page https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models

  11. arXiv:2312.07231  [pdf, other

    cs.CV cs.AI cs.LG

    Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation

    Authors: Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li

    Abstract: Diffusion Transformers have recently shown remarkable effectiveness in generating high-quality 3D point clouds. However, training voxel-based diffusion models for high-resolution 3D voxels remains prohibitively expensive due to the cubic complexity of attention operators, which arises from the additional dimension of voxels. Motivated by the inherent redundancy of 3D compared to 2D, we propose Fas… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: Project Page: https://dit-3d.github.io/FastDiT-3D/

  12. arXiv:2312.02936  [pdf, other

    cs.CV

    Drag-A-Video: Non-rigid Video Editing with Point-based Interaction

    Authors: Yao Teng, Enze Xie, Yue Wu, Haoyu Han, Zhenguo Li, Xihui Liu

    Abstract: Video editing is a challenging task that requires manipulating videos on both the spatial and temporal dimensions. Existing methods for video editing mainly focus on changing the appearance or style of the objects in the video, while keeping their structures unchanged. However, there is no existing method that allows users to interactively ``drag'' any points of instances on the first frame to pre… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

  13. arXiv:2311.14603  [pdf, other

    cs.CV

    Animate124: Animating One Image to 4D Dynamic Scene

    Authors: Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhenguo Li, Gim Hee Lee

    Abstract: We introduce Animate124 (Animate-one-image-to-4D), the first work to animate a single in-the-wild image into 3D video through textual motion descriptions, an underexplored problem with significant applications. Our 4D generation leverages an advanced 4D grid dynamic Neural Radiance Field (NeRF) model, optimized in three distinct stages using multiple diffusion priors. Initially, a static model is… ▽ More

    Submitted 18 February, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

    Comments: Project Page: https://animate124.github.io

  14. arXiv:2311.14580  [pdf, other

    cs.CV

    Large Language Models as Automated Aligners for benchmarking Vision-Language Models

    Authors: Yuanfeng Ji, Chongjian Ge, Weikai Kong, Enze Xie, Zhengying Liu, Zhengguo Li, Ping Luo

    Abstract: With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of th… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

  15. arXiv:2311.01682  [pdf, other

    cs.CV

    Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

    Authors: Haibao Yu, Yingjuan Tang, Enze Xie, Jilei Mao, Ping Luo, Zaiqing Nie

    Abstract: Cooperatively utilizing both ego-vehicle and infrastructure sensor data can significantly enhance autonomous driving perception abilities. However, the uncertain temporal asynchrony and limited communication conditions can lead to fusion misalignment and constrain the exploitation of infrastructure data. To address these issues in vehicle-infrastructure cooperative 3D (VIC3D) object detection, we… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

    Comments: Accepted by NeurIPs2023. arXiv admin note: text overlap with arXiv:2303.10552

  16. arXiv:2310.02954  [pdf, other

    cs.CL

    DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning

    Authors: Jing Xiong, Zixuan Li, Chuanyang Zheng, Zhijiang Guo, Yichun Yin, Enze Xie, Zhicheng Yang, Qingxing Cao, Haiming Wang, Xiongwei Han, Jing Tang, Chengming Li, Xiaodan Liang

    Abstract: Recent advances in natural language processing, primarily propelled by Large Language Models (LLMs), have showcased their remarkable capabilities grounded in in-context learning. A promising avenue for guiding LLMs in intricate reasoning tasks involves the utilization of intermediate reasoning steps within the Chain-of-Thought (CoT) paradigm. Nevertheless, the central challenge lies in the effecti… ▽ More

    Submitted 2 March, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted in ICLR 2024

  17. arXiv:2310.02601  [pdf, other

    cs.CV cs.AI

    MagicDrive: Street View Generation with Diverse 3D Geometry Control

    Authors: Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, Qiang Xu

    Abstract: Recent advancements in diffusion models have significantly enhanced the data synthesis with 2D control. Yet, precise 3D control in street view generation, crucial for 3D perception tasks, remains elusive. Specifically, utilizing Bird's-Eye View (BEV) as the primary condition often leads to challenges in geometry control (e.g., height), affecting the representation of object shapes, occlusion patte… ▽ More

    Submitted 3 May, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Project Page: https://flymin.github.io/magicdrive; Figure 7 updated

  18. arXiv:2310.01412  [pdf, other

    cs.CV cs.RO

    DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model

    Authors: Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee. K. Wong, Zhenguo Li, Hengshuang Zhao

    Abstract: Multimodal large language models (MLLMs) have emerged as a prominent area of interest within the research community, given their proficiency in handling and reasoning with non-textual data, including images and videos. This study seeks to extend the application of MLLMs to the realm of autonomous driving by introducing DriveGPT4, a novel interpretable end-to-end autonomous driving system based on… ▽ More

    Submitted 14 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: The project page is available at https://tonyxuqaq.github.io/projects/DriveGPT4/

  19. arXiv:2310.00656  [pdf, other

    cs.AI

    LEGO-Prover: Neural Theorem Proving with Growing Libraries

    Authors: Haiming Wang, Huajian Xin, Chuanyang Zheng, Lin Li, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, Jian Yin, Zhenguo Li, Heng Liao, Xiaodan Liang

    Abstract: Despite the success of large language models (LLMs), the task of theorem proving still remains one of the hardest reasoning tasks that is far from being fully solved. Prior methods using language models have demonstrated promising results, but they still struggle to prove even middle school level theorems. One common limitation of these methods is that they assume a fixed theorem library during th… ▽ More

    Submitted 27 October, 2023; v1 submitted 1 October, 2023; originally announced October 2023.

  20. arXiv:2310.00426  [pdf, other

    cs.CV

    PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Authors: Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li

    Abstract: The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$α$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and eve… ▽ More

    Submitted 29 December, 2023; v1 submitted 30 September, 2023; originally announced October 2023.

    Comments: Project Page: https://pixart-alpha.github.io

  21. arXiv:2309.15806  [pdf, other

    cs.CL cs.AI

    Lyra: Orchestrating Dual Correction in Automated Theorem Proving

    Authors: Chuanyang Zheng, Haiming Wang, Enze Xie, Zhengying Liu, Jiankai Sun, Huajian Xin, Jianhao Shen, Zhenguo Li, Yu Li

    Abstract: Large Language Models (LLMs) present an intriguing avenue for exploration in the field of formal theorem proving. Nevertheless, their full potential, particularly concerning the mitigation of hallucinations and refinement through prover error messages, remains an area that has yet to be thoroughly investigated. To enhance the effectiveness of LLMs in the field, we introduce the Lyra, a new framewo… ▽ More

    Submitted 7 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Tech Report

  22. arXiv:2308.13853  [pdf, other

    cs.CV

    Beyond One-to-One: Rethinking the Referring Image Segmentation

    Authors: Yutao Hu, Qixiong Wang, Wenqi Shao, Enze Xie, Zhenguo Li, Jungong Han, Ping Luo

    Abstract: Referring image segmentation aims to segment the target object referred by a natural language expression. However, previous methods rely on the strong assumption that one sentence must describe one target in the image, which is often not the case in real-world applications. As a result, such methods fail when the expressions refer to either no objects or multiple objects. In this paper, we address… ▽ More

    Submitted 26 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  23. arXiv:2307.06350  [pdf, other

    cs.CV

    T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

    Authors: Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, Xihui Liu

    Abstract: Despite the stunning ability to generate high-quality images by recent text-to-image models, current approaches often struggle to effectively compose objects with different attributes and relationships into a complex and coherent scene. We propose T2I-CompBench, a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional text prompts from 3 ca… ▽ More

    Submitted 30 October, 2023; v1 submitted 12 July, 2023; originally announced July 2023.

    Comments: Project page: https://karine-h.github.io/T2I-CompBench/

  24. arXiv:2307.04106  [pdf, other

    cs.CV

    Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's Eye View

    Authors: Jiayu Yang, Enze Xie, Miaomiao Liu, Jose M. Alvarez

    Abstract: Recent vision-only perception models for autonomous driving achieved promising results by encoding multi-view image features into Bird's-Eye-View (BEV) space. A critical step and the main bottleneck of these methods is transforming image features into the BEV coordinate frame. This paper focuses on leveraging geometry information, such as depth, to model such feature transformation. Existing works… ▽ More

    Submitted 11 July, 2023; v1 submitted 9 July, 2023; originally announced July 2023.

  25. arXiv:2307.02159  [pdf, other

    stat.ML cs.CV cs.LG math.AP

    DiffFlow: A Unified SDE Framework for Score-Based Diffusion Models and Generative Adversarial Networks

    Authors: Jingwei Zhang, Han Shi, Jincheng Yu, Enze Xie, Zhenguo Li

    Abstract: Generative models can be categorized into two types: explicit generative models that define explicit density forms and allow exact likelihood inference, such as score-based diffusion models (SDMs) and normalizing flows; implicit generative models that directly learn a transformation from the prior to the data distribution, such as generative adversarial nets (GANs). While these two types of models… ▽ More

    Submitted 5 July, 2023; originally announced July 2023.

    Comments: Tech Report

  26. arXiv:2307.01831  [pdf, other

    cs.CV cs.AI cs.LG

    DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation

    Authors: Shentong Mo, Enze Xie, Ruihang Chu, Lewei Yao, Lanqing Hong, Matthias Nießner, Zhenguo Li

    Abstract: Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, we propose a novel Diffusion Transformer for 3D shape genera… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

    Comments: Project Page: https://dit-3d.github.io/

  27. arXiv:2306.16329  [pdf, other

    cs.CV

    DiffComplete: Diffusion-based Generative 3D Shape Completion

    Authors: Ruihang Chu, Enze Xie, Shentong Mo, Zhenguo Li, Matthias Nießner, Chi-Wing Fu, Jiaya Jia

    Abstract: We introduce a new diffusion-based approach for shape completion on 3D range scans. Compared with prior deterministic and probabilistic methods, we strike a balance between realism, multi-modality, and high fidelity. We propose DiffComplete by casting shape completion as a generative task conditioned on the incomplete shape. Our key designs are two-fold. First, we devise a hierarchical feature agg… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

    Comments: Project Page: https://ruihangchu.com/diffcomplete.html

  28. arXiv:2306.04607  [pdf, other

    cs.CV cs.AI

    GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation

    Authors: Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung

    Abstract: Diffusion models have attracted significant attention due to the remarkable ability to create content and generate data for tasks like image classification. However, the usage of diffusion models to generate the high-quality object detection data remains an underexplored area, where not only image-level perceptual quality but also geometric conditions such as bounding boxes and camera views are es… ▽ More

    Submitted 16 February, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

    Comments: Accept by ICLR 2024. Project Page: https://kaichen1998.github.io/projects/geodiffusion/

  29. arXiv:2305.08850  [pdf, other

    cs.CV

    Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts

    Authors: Yuyang Zhao, Enze Xie, Lanqing Hong, Zhenguo Li, Gim Hee Lee

    Abstract: The text-driven image and video diffusion models have achieved unprecedented success in generating realistic and diverse content. Recently, the editing and variation of existing images and videos in diffusion-based generative models have garnered significant attention. However, previous works are limited to editing content with text or providing coarse personalization using a single visual clue, r… ▽ More

    Submitted 18 February, 2024; v1 submitted 15 May, 2023; originally announced May 2023.

    Comments: Project page: https://make-a-protagonist.github.io

  30. Periodicity Analysis of the Logistic Map over Ring $\mathbb{Z}_{3^n}$

    Authors: Xiaoxiong Lu, Eric Yong Xie, Chengqing Li

    Abstract: Periodicity analysis of sequences generated by a deterministic system is a long-standing challenge in both theoretical research and engineering applications. To overcome the inevitable degradation of the Logistic map on a finite-precision circuit, its numerical domain is commonly converted from a real number field to a ring or a finite field. This paper studies the period of sequences generated by… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.

    Comments: 10 pages

    MSC Class: 65P20

    Journal ref: International Journal of Bifurcation and Chaos, vol. 33, no. 5, art. no. 2350063, 2023

  31. arXiv:2304.09801  [pdf, other

    cs.CV

    MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation

    Authors: Chongjian Ge, Junsong Chen, Enze Xie, Zhongdao Wang, Lanqing Hong, Huchuan Lu, Zhenguo Li, Ping Luo

    Abstract: Perception systems in modern autonomous driving vehicles typically take inputs from complementary multi-modal sensors, e.g., LiDAR and cameras. However, in real-world applications, sensor corruptions and failures lead to inferior performances, thus compromising autonomous safety. In this paper, we propose a robust framework, called MetaBEV, to address extreme real-world environments involving over… ▽ More

    Submitted 19 April, 2023; originally announced April 2023.

    Comments: Project page: https://chongjiange.github.io/metabev.html

  32. arXiv:2304.09797  [pdf, other

    cs.CL cs.LG

    Progressive-Hint Prompting Improves Reasoning in Large Language Models

    Authors: Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, Yu Li

    Abstract: The performance of Large Language Models (LLMs) in reasoning tasks depends heavily on prompt design, with Chain-of-Thought (CoT) and self-consistency being critical methods that enhance this ability. However, these methods do not fully exploit the answers generated by the LLM to guide subsequent responses. This paper proposes a new prompting method, named Progressive-Hint Prompting (PHP), that ena… ▽ More

    Submitted 9 August, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

    Comments: Tech Report

  33. arXiv:2304.06648  [pdf, other

    cs.CV

    DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning

    Authors: Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, Zhenguo Li

    Abstract: Diffusion models have proven to be highly effective in generating high-quality images. However, adapting large pre-trained diffusion models to new domains remains an open challenge, which is critical for real-world applications. This paper proposes DiffFit, a parameter-efficient strategy to fine-tune large pre-trained diffusion models that enable fast adaptation to new domains. DiffFit is embarras… ▽ More

    Submitted 27 July, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: Tech Report

  34. arXiv:2304.01168  [pdf, other

    cs.CV cs.LG cs.RO

    DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving

    Authors: Tianqi Wang, Sukmin Kim, Wenxuan Ji, Enze Xie, Chongjian Ge, Junsong Chen, Zhenguo Li, Ping Luo

    Abstract: Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports the direct and explainable safety evaluation for autonomous driving. In this work, we propose DeepAccident, a large-scale dataset generated via a realistic simulator containing diverse accident scenarios that frequently occur in real-world driving. The proposed DeepAccident dataset includes… ▽ More

    Submitted 17 December, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

  35. arXiv:2303.17559  [pdf, other

    cs.CV

    DDP: Diffusion Model for Dense Visual Prediction

    Authors: Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo

    Abstract: We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. Our approach follows a "noise-to-map" generative paradigm for prediction by progressively removing noise from a random Gaussian distribution, guided by the image. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipel… ▽ More

    Submitted 13 May, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: Added controlnet exp

  36. arXiv:2303.10552  [pdf, other

    cs.CV

    Vehicle-Infrastructure Cooperative 3D Object Detection via Feature Flow Prediction

    Authors: Haibao Yu, Yingjuan Tang, Enze Xie, Jilei Mao, Jirui Yuan, Ping Luo, Zaiqing Nie

    Abstract: Cooperatively utilizing both ego-vehicle and infrastructure sensor data can significantly enhance autonomous driving perception abilities. However, temporal asynchrony and limited wireless communication in traffic environments can lead to fusion misalignment and impact detection performance. This paper proposes Feature Flow Net (FFNet), a novel cooperative detection framework that uses a feature f… ▽ More

    Submitted 18 March, 2023; originally announced March 2023.

    Comments: Under Review

  37. arXiv:2301.12511  [pdf, other

    cs.CV

    Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline

    Authors: Yangguang Li, Bin Huang, Zeren Chen, Yufeng Cui, Feng Liang, Mingzhu Shen, Fenggang Liu, Enze Xie, Lu Sheng, Wanli Ouyang, Jing Shao

    Abstract: Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effectiv… ▽ More

    Submitted 29 January, 2023; originally announced January 2023.

    Comments: submitted to TPAMI. arXiv admin note: substantial text overlap with arXiv:2301.07870

  38. arXiv:2301.07870  [pdf, other

    cs.CV

    Fast-BEV: Towards Real-time On-vehicle Bird's-Eye View Perception

    Authors: Bin Huang, Yangguang Li, Enze Xie, Feng Liang, Luya Wang, Mingzhu Shen, Fenggang Liu, Tianqi Wang, Ping Luo, Jing Shao

    Abstract: Recently, the pure camera-based Bird's-Eye-View (BEV) perception removes expensive Lidar sensors, making it a feasible solution for economical autonomous driving. However, most existing BEV solutions either suffer from modest performance or require considerable resources to execute on-vehicle inference. This paper proposes a simple yet effective framework, termed Fast-BEV, which is capable of perf… ▽ More

    Submitted 18 January, 2023; originally announced January 2023.

    Comments: Accepted by NeurIPS2022_ML4AD on October 22, 2022

    Journal ref: NeurIPS2022_ML4AD

  39. arXiv:2209.05324  [pdf, other

    cs.CV cs.LG cs.RO

    Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe

    Authors: Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, Hao Tian, Enze Xie, Jiangwei Xie, Li Chen, Tianyu Li, Yang Li, Yulu Gao, Xiaosong Jia, Si Liu, Jianping Shi, Dahua Lin, Yu Qiao

    Abstract: Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia. Conventional approaches for most autonomous driving algorithms perform detection, segmentation, tracking, etc., in a front or perspective view. As sensor configurations get more complex, integrating multi-source information from different sens… ▽ More

    Submitted 27 September, 2023; v1 submitted 12 September, 2022; originally announced September 2022.

    Comments: https://github.com/OpenDriveLab/Birds-eye-view-Perception

  40. arXiv:2205.04683  [pdf, other

    cs.CV

    UNITS: Unsupervised Intermediate Training Stage for Scene Text Detection

    Authors: Youhui Guo, Yu Zhou, Xugong Qin, Enze Xie, Weiping Wang

    Abstract: Recent scene text detection methods are almost based on deep learning and data-driven. Synthetic data is commonly adopted for pre-training due to expensive annotation cost. However, there are obvious domain discrepancies between synthetic data and real-world data. It may lead to sub-optimal performance to directly adopt the model initialized by synthetic data in the fine-tuning stage. In this pape… ▽ More

    Submitted 10 May, 2022; originally announced May 2022.

    Comments: Accepted by ICME 2022

  41. arXiv:2204.12451  [pdf, other

    cs.CV

    Understanding The Robustness in Vision Transformers

    Authors: Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Anima Anandkumar, Jiashi Feng, Jose M. Alvarez

    Abstract: Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual gr… ▽ More

    Submitted 8 November, 2022; v1 submitted 26 April, 2022; originally announced April 2022.

  42. arXiv:2204.05088  [pdf, other

    cs.CV

    M$^2$BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

    Authors: Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, Ping Luo, Jose M. Alvarez

    Abstract: In this paper, we propose M$^2$BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs. Unlike the majority of previous works which separately process detection and segmentation, M$^2$BEV infers both tasks with a unified model and improves efficiency. M$^2$BEV efficiently transforms multi-view 2D image… ▽ More

    Submitted 19 April, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Tech Report

  43. arXiv:2204.01268  [pdf, other

    cs.CV

    Improving Monocular Visual Odometry Using Learned Depth

    Authors: Libo Sun, Wei Yin, Enze Xie, Zhengrong Li, Changming Sun, Chunhua Shen

    Abstract: Monocular visual odometry (VO) is an important task in robotics and computer vision. Thus far, how to build accurate and robust monocular VO systems that can work well in diverse scenarios remains largely unsolved. In this paper, we propose a framework to exploit monocular depth estimation for improving VO. The core of our framework is a monocular depth estimation module with a strong generalizati… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

  44. arXiv:2203.17270  [pdf, other

    cs.CV

    BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

    Authors: Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, Jifeng Dai

    Abstract: 3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal in… ▽ More

    Submitted 13 July, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted to ECCV 2022

  45. arXiv:2203.08421  [pdf, other

    cs.CV

    WegFormer: Transformers for Weakly Supervised Semantic Segmentation

    Authors: Chunmeng Liu, Enze Xie, Wenjia Wang, Wenhai Wang, Guangyao Li, Ping Luo

    Abstract: Although convolutional neural networks (CNNs) have achieved remarkable progress in weakly supervised semantic segmentation (WSSS), the effective receptive field of CNN is insufficient to capture global context information, leading to sub-optimal results. Inspired by the great success of Transformers in fundamental vision areas, this work for the first time introduces Transformer to build a simple… ▽ More

    Submitted 16 March, 2022; originally announced March 2022.

    Comments: Tech Report

  46. arXiv:2111.02394  [pdf, other

    cs.CV

    FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation

    Authors: Zhe Chen, Jiahao Wang, Wenhai Wang, Guo Chen, Enze Xie, Ping Luo, Tong Lu

    Abstract: We propose an accurate and efficient scene text detection framework, termed FAST (i.e., faster arbitrarily-shaped text detector). Different from recent advanced text detectors that used complicated post-processing and hand-crafted network architectures, resulting in low inference speed, FAST has two new designs. (1) We design a minimalist kernel representation (only has 1-channel output) to model… ▽ More

    Submitted 11 January, 2023; v1 submitted 3 November, 2021; originally announced November 2021.

  47. arXiv:2109.03814  [pdf, other

    cs.CV

    Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers

    Authors: Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo, Tong Lu

    Abstract: Panoptic segmentation involves a combination of joint semantic segmentation and instance segmentation, where image contents are divided into two types: things and stuff. We present Panoptic SegFormer, a general framework for panoptic segmentation with transformers. It contains three innovative components: an efficient deeply-supervised mask decoder, a query decoupling strategy, and an improved pos… ▽ More

    Submitted 18 March, 2022; v1 submitted 8 September, 2021; originally announced September 2021.

    Comments: Accepted to CVPR 2022

  48. arXiv:2107.10224  [pdf, other

    cs.CV

    CycleMLP: A MLP-like Architecture for Dense Prediction

    Authors: Shoufa Chen, Enze Xie, Chongjian Ge, Runjian Chen, Ding Liang, Ping Luo

    Abstract: This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions. As compared to modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation, CycleMLP has two advantages compared to modern approaches. (1) It can cope… ▽ More

    Submitted 18 March, 2022; v1 submitted 21 July, 2021; originally announced July 2021.

    Comments: ICLR 2022 (Oral). Camera-ready Code: https://github.com/ShoufaChen/CycleMLP

  49. PVT v2: Improved Baselines with Pyramid Vision Transformer

    Authors: Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao

    Abstract: Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of… ▽ More

    Submitted 17 April, 2023; v1 submitted 25 June, 2021; originally announced June 2021.

    Comments: Accepted to CVMJ 2022

    Journal ref: Computational Visual Media, 2022, Vol. 8, No. 3, Pages: 415-424

  50. arXiv:2105.15203  [pdf, other

    cs.CV cs.LG

    SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

    Authors: Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo

    Abstract: We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of posit… ▽ More

    Submitted 28 October, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

    Comments: Accepted by NeurIPS 2021