Skip to main content

Showing 1–50 of 152 results for author: Ding, E

  1. arXiv:2407.11335  [pdf, other

    cs.CV

    LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

    Authors: Penghui Du, Yu Wang, Yifan Sun, Luting Wang, Yue Liao, Gang Zhang, Errui Ding, Yan Wang, Jingdong Wang, Si Liu

    Abstract: Existing methods enhance open-vocabulary object detection by leveraging the robust open-vocabulary recognition capabilities of Vision-Language Models (VLMs), such as CLIP.However, two main challenges emerge:(1) A deficiency in concept representation, where the category names in CLIP's text space lack textual and visual knowledge.(2) An overfitting tendency towards base categories, with the open vo… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: ECCV2024

  2. arXiv:2407.10753  [pdf, other

    cs.CV

    OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection

    Authors: Jinghua Hou, Tong Wang, Xiaoqing Ye, Zhe Liu, Shi Gong, Xiao Tan, Errui Ding, Jingdong Wang, Xiang Bai

    Abstract: Accurate depth information is crucial for enhancing the performance of multi-view 3D object detection. Despite the success of some existing multi-view 3D detectors utilizing pixel-wise depth supervision, they overlook two significant phenomena: 1) the depth supervision obtained from LiDAR points is usually distributed on the surface of the object, which is not so friendly to existing DETR-based 3D… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  3. arXiv:2407.10655  [pdf, other

    cs.CV

    OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

    Authors: Yu Wang, Xiangbo Su, Qiang Chen, Xinyu Zhang, Teng Xi, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang

    Abstract: Open-vocabulary object detection focusing on detecting novel categories guided by natural language. In this report, we propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency. Building upon OVLW-DETR, we provide an end-to-end training recipe that transferring knowledge from vision-language mode… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 4 pages

  4. arXiv:2406.18360  [pdf, other

    cs.CV

    XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis

    Authors: Hao Li, Ming Yuan, Yan Zhang, Chenming Wu, Chen Zhao, Chunyu Song, Haocheng Feng, Errui Ding, Dingwen Zhang, Jingdong Wang

    Abstract: Thoroughly testing autonomy systems is crucial in the pursuit of safe autonomous driving vehicles. It necessitates creating safety-critical scenarios that go beyond what can be safely collected from real-world data, as many of these scenarios occur infrequently on public roads. However, the evaluation of most existing NVS methods relies on sporadic sampling of image frames from the training data,… ▽ More

    Submitted 26 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: project page: https://3d-aigc.github.io/XLD/

  5. arXiv:2406.18198  [pdf, other

    cs.CV

    VDG: Vision-Only Dynamic Gaussian for Driving Simulation

    Authors: Hao Li, Jingfeng Li, Dingwen Zhang, Chenming Wu, Jieqi Shi, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Junwei Han

    Abstract: Dynamic Gaussian splatting has led to impressive scene reconstruction and image synthesis advances in novel views. Existing methods, however, heavily rely on pre-computed poses and Gaussian initialization by Structure from Motion (SfM) algorithms or expensive sensors. For the first time, this paper addresses this issue by integrating self-supervised VO into our pose-free dynamic Gaussian method (V… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  6. arXiv:2406.08814  [pdf, other

    cs.CV

    Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

    Authors: Zhengqi Zhao, Xiaohu Huang, Hao Zhou, Kun Yao, Errui Ding, Jingdong Wang, Xinggang Wang, Wenyu Liu, Bin Feng

    Abstract: The key to action counting is accurately locating each video's repetitive actions. Instead of estimating the probability of each frame belonging to an action directly, we propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner. The model draws inspiration from empirical observations indicating that humans typically engage in coarse skimming of entire sequences to grasp the… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 13 pages, 9 figures

  7. arXiv:2406.03459  [pdf, other

    cs.CV

    LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

    Authors: Qiang Chen, Xiangbo Su, Xinyu Zhang, Jian Wang, Jiahui Chen, Yunpeng Shen, Chuchu Han, Ziliang Chen, Weixiang Xu, Fanrong Li, Shan Zhang, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang

    Abstract: In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for r… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  8. arXiv:2406.02058  [pdf, other

    cs.CV cs.RO

    OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding

    Authors: Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Jian Zhang

    Abstract: This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting (3DGS) capable of 3D point-level open vocabulary understanding. Our primary motivation stems from observing that existing 3DGS-based open vocabulary methods mainly focus on 2D pixel-level parsing. These methods struggle with 3D point-level tasks due to weak feature expressiveness and inaccurate 2D-3D feature associations.… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: technical report, 15 pages

  9. arXiv:2405.21013  [pdf, other

    cs.CV

    StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond

    Authors: Pengyuan Lyu, Yulin Li, Hao Zhou, Weihong Ma, Xingyu Wan, Qunyi Xie, Liang Wu, Chengquan Zhang, Kun Yao, Errui Ding, Jingdong Wang

    Abstract: Text-rich images have significant and extensive value, deeply integrated into various aspects of human life. Notably, both visual cues and linguistic symbols in text-rich images play crucial roles in information transmission but are accompanied by diverse challenges. Therefore, the efficient and effective understanding of text-rich images is a crucial litmus test for the capability of Vision-Langu… ▽ More

    Submitted 4 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

  10. arXiv:2405.19765  [pdf, other

    cs.CV cs.AI

    Towards Unified Multi-granularity Text Detection with Interactive Attention

    Authors: Xingyu Wan, Chengquan Zhang, Pengyuan Lyu, Sen Fan, Zihan Ni, Kun Yao, Errui Ding, Jingdong Wang

    Abstract: Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands. In this paper, we introduce "Detect Any Text" (DAT), an advanced paradigm that seamlessly unifies scene text detection, layout analysis, and document page detection into a… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: ICML 2024

  11. arXiv:2403.17387  [pdf, other

    cs.CV

    Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection

    Authors: Jiacheng Zhang, Jiaming Li, Xiangru Lin, Wei Zhang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li

    Abstract: We delve into pseudo-labeling for semi-supervised monocular 3D object detection (SSM3OD) and discover two primary issues: a misalignment between the prediction quality of 3D and 2D attributes and the tendency of depth supervision derived from pseudo-labels to be noisy, leading to significant optimization conflicts with other reliable forms of supervision. We introduce a novel decoupled pseudo-labe… ▽ More

    Submitted 23 April, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: To appear in CVPR2024

  12. arXiv:2403.15127  [pdf, other

    cs.CV

    Gradient-based Sampling for Class Imbalanced Semi-supervised Object Detection

    Authors: Jiaming Li, Xiangru Lin, Wei Zhang, Xiao Tan, Yingying Li, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li

    Abstract: Current semi-supervised object detection (SSOD) algorithms typically assume class balanced datasets (PASCAL VOC etc.) or slightly class imbalanced datasets (MS-COCO, etc). This assumption can be easily violated since real world datasets can be extremely class imbalanced in nature, thus making the performance of semi-supervised object detectors far from satisfactory. Besides, the research for this… ▽ More

    Submitted 22 March, 2024; originally announced March 2024.

    Comments: Accepted by ICCV2023

  13. arXiv:2403.15009  [pdf, other

    cs.CV

    TexRO: Generating Delicate Textures of 3D Models by Recursive Optimization

    Authors: Jinbo Wu, Xing Liu, Chenming Wu, Xiaobo Gao, Jialun Liu, Xinqi Liu, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang

    Abstract: This paper presents TexRO, a novel method for generating delicate textures of a known 3D mesh by optimizing its UV texture. The key contributions are two-fold. We propose an optimal viewpoint selection strategy, that finds the most miniature set of viewpoints covering all the faces of a mesh. Our viewpoint selection strategy guarantees the completeness of a generated result. We propose a recursive… ▽ More

    Submitted 22 March, 2024; originally announced March 2024.

    Comments: Technical report. Project page: https://3d-aigc.github.io/TexRO

  14. arXiv:2403.10147  [pdf, other

    cs.CV

    GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time

    Authors: Hao Li, Yuanyuan Gao, Chenming Wu, Dingwen Zhang, Yalun Dai, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Junwei Han

    Abstract: This paper presents GGRt, a novel approach to generalizable novel view synthesis that alleviates the need for real camera poses, complexity in processing high-resolution images, and lengthy optimization processes, thus facilitating stronger applicability of 3D Gaussian Splatting (3D-GS) in real-world scenarios. Specifically, we design a novel joint learning framework that consists of an Iterative… ▽ More

    Submitted 18 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

    Comments: Project page: https://3d-aigc.github.io/GGRt

  15. arXiv:2402.17726  [pdf, other

    cs.CV

    VRP-SAM: SAM with Visual Reference Prompt

    Authors: Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li

    Abstract: In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder ca… ▽ More

    Submitted 30 March, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

    Comments: Accepted by CVPR 2024; The camera-ready version

  16. arXiv:2402.16607  [pdf, other

    cs.CV

    GVA: Reconstructing Vivid 3D Gaussian Avatars from Monocular Videos

    Authors: Xinqi Liu, Chenming Wu, Jialun Liu, Xing Liu, Jinbo Wu, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang

    Abstract: In this paper, we present a novel method that facilitates the creation of vivid 3D Gaussian avatars from monocular video inputs (GVA). Our innovation lies in addressing the intricate challenges of delivering high-fidelity human body reconstructions and aligning 3D Gaussians with human skin surfaces accurately. The key contributions of this paper are twofold. Firstly, we introduce a pose refinement… ▽ More

    Submitted 19 March, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

  17. arXiv:2401.03989  [pdf, other

    cs.CV

    MS-DETR: Efficient DETR Training with Mixed Supervision

    Authors: Chuyang Zhao, Yifan Sun, Wenhao Wang, Qiang Chen, Errui Ding, Yi Yang, Jingdong Wang

    Abstract: DETR accomplishes end-to-end object detection through iteratively generating multiple object candidates based on image features and promoting one candidate for each ground-truth object. The traditional training procedure using one-to-one supervision in the original DETR lacks direct supervision for the object detection candidates. We aim at improving the DETR training efficiency by explicitly su… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  18. arXiv:2312.05133  [pdf, other

    cs.CV

    GIR: 3D Gaussian Inverse Rendering for Relightable Scene Factorization

    Authors: Yahao Shi, Yanmin Wu, Chenming Wu, Xing Liu, Chen Zhao, Haocheng Feng, Jingtuo Liu, Liangjun Zhang, Jian Zhang, Bin Zhou, Errui Ding, Jingdong Wang

    Abstract: This paper presents GIR, a 3D Gaussian Inverse Rendering method for relightable scene factorization. Compared to existing methods leveraging discrete meshes or neural implicit fields for inverse rendering, our method utilizes 3D Gaussians to estimate the material properties, illumination, and geometry of an object from multi-view images. Our study is motivated by the evidence showing that 3D Gauss… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

    Comments: technical report

  19. arXiv:2311.18435  [pdf, other

    cs.CV

    Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis

    Authors: Zipeng Qi, Guoxi Huang, Zebin Huang, Qin Guo, Jinwen Chen, Junyu Han, Jian Wang, Gang Zhang, Lufei Liu, Errui Ding, Jingdong Wang

    Abstract: This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries. We present two key innovations: Vision Guidance and the Layered Rendering Diffusion (LRDiff) framework. Vision Guidance, a spatial layout condition, acts as a clue in the perturbed distribution, greatly narrowing down the search space, to focus on the image sampling process ad… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

  20. arXiv:2310.20695  [pdf, other

    cs.CV cs.AI

    HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

    Authors: Junkun Yuan, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao, Shaofeng Zhang, Sifan Long, Kun Kuang, Kun Yao, Junyu Han, Errui Ding, Lanfen Lin, Fei Wu, Jingdong Wang

    Abstract: Model pre-training is essential in human-centric perception. In this paper, we first introduce masked image modeling (MIM) as a pre-training approach for this task. Upon revisiting the MIM training strategy, we reveal that human structure priors offer significant potential. Motivated by this insight, we further incorporate an intuitive human structure prior - human parts - into pre-training. Speci… ▽ More

    Submitted 31 October, 2023; originally announced October 2023.

    Comments: Accepted by NeurIPS 2023

  21. arXiv:2310.07664  [pdf, other

    cs.CV

    Accelerating Vision Transformers Based on Heterogeneous Attention Patterns

    Authors: Deli Yu, Teng Xi, Jianwei Li, Baopu Li, Gang Zhang, Haocheng Feng, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang

    Abstract: Recently, Vision Transformers (ViTs) have attracted a lot of attention in the field of computer vision. Generally, the powerful representative capacity of ViTs mainly benefits from the self-attention mechanism, which has a high computation complexity. To accelerate ViTs, we propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers. On one hand, dif… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

  22. Effects of Ionic Strength on the Morphology, Scattering, and Mechanical Response of Neurofilament-Derived Protein Brushes

    Authors: Takashi J. Yokokura, Chao Duan, Erika A. Ding, Sanjay Kumar, Rui Wang

    Abstract: Protein brushes not only play a key role in the functionality of neurofilaments but also have wide applications in biomedical materials. Here, we investigate the effect of ionic strength on the morphology of protein brushes using a continuous-space self-consistent field theory. A coarse-grained multi-block charged macromolecular model is developed to capture the chemical identity of amino acid seq… ▽ More

    Submitted 27 December, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

  23. arXiv:2309.17390  [pdf, other

    cs.CV

    Forward Flow for Novel View Synthesis of Dynamic Scenes

    Authors: Xiang Guo, Jiadai Sun, Yuchao Dai, Guanying Chen, Xiaoqing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, Jingdong Wang

    Abstract: This paper proposes a neural radiance field (NeRF) approach for novel view synthesis of dynamic scenes using forward warping. Existing methods often adopt a static NeRF to represent the canonical space, and render dynamic images at other time steps by mapping the sampled 3D points back to the canonical space with the learned backward flow field. However, this backward flow field is non-smooth and… ▽ More

    Submitted 29 September, 2023; originally announced September 2023.

    Comments: Accepted by ICCV2023 as oral. Project page: https://npucvr.github.io/ForwardFlowDNeRF

    Journal ref: ICCV2023

  24. arXiv:2309.00398  [pdf, other

    cs.CV cs.MM

    VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

    Authors: Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, Jingdong Wang

    Abstract: In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide vi… ▽ More

    Submitted 7 September, 2023; v1 submitted 1 September, 2023; originally announced September 2023.

    Comments: 8pages, 8figures, project page: https://videogen.github.io/VideoGen/

  25. arXiv:2308.07313  [pdf, other

    cs.CV

    Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

    Authors: Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Errui Ding, Yao Zhao, Jingdong Wang

    Abstract: In this paper, we study the problem of end-to-end multi-person pose estimation. State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e.g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR. We present a simple yet effective t… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted by ICCV 2023

  26. arXiv:2307.16183  [pdf, other

    cs.CV

    HD-Fusion: Detailed Text-to-3D Generation Leveraging Multiple Noise Estimation

    Authors: Jinbo Wu, Xiaobo Gao, Xing Liu, Zhengyang Shen, Chen Zhao, Haocheng Feng, Jingtuo Liu, Errui Ding

    Abstract: In this paper, we study Text-to-3D content generation leveraging 2D diffusion priors to enhance the quality and detail of the generated 3D models. Recent progress (Magic3D) in text-to-3D has shown that employing high-resolution (e.g., 512 x 512) renderings can lead to the production of high-quality 3D models using latent diffusion priors. To enable rendering at even higher resolutions, which has t… ▽ More

    Submitted 30 July, 2023; originally announced July 2023.

  27. arXiv:2307.08095  [pdf, other

    cs.CV

    Semi-DETR: Semi-Supervised Object Detection with Detection Transformers

    Authors: Jiacheng Zhang, Xiangru Lin, Wei Zhang, Kuo Wang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li

    Abstract: We analyze the DETR-based framework on semi-supervised object detection (SSOD) and observe that (1) the one-to-one assignment strategy generates incorrect matching when the pseudo ground-truth bounding box is inaccurate, leading to training inefficiency; (2) DETR-based detectors lack deterministic correspondence between the input query and its prediction output, which hinders the applicability of… ▽ More

    Submitted 16 July, 2023; originally announced July 2023.

    Comments: CVPR2023

  28. arXiv:2306.17074  [pdf, other

    cs.CV cs.AI

    Learning Structure-Guided Diffusion Model for 2D Human Pose Estimation

    Authors: Zhongwei Qiu, Qiansheng Yang, Jian Wang, Xiyu Wang, Chang Xu, Dongmei Fu, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

    Abstract: One of the mainstream schemes for 2D human pose estimation (HPE) is learning keypoints heatmaps by a neural network. Existing methods typically improve the quality of heatmaps by customized architectures, such as high-resolution representation and vision Transformers. In this paper, we propose \textbf{DiffusionPose}, a new scheme that formulates 2D HPE as a keypoints heatmaps generation problem fr… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

  29. arXiv:2306.03287  [pdf, other

    cs.CV

    ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

    Authors: Wenwen Yu, Chengquan Zhang, Haoyu Cao, Wei Hua, Bohan Li, Huang Chen, Mingyu Liu, Mingrui Chen, Jianfeng Kuang, Mengjun Cheng, Yuning Du, Shikun Feng, Xiaoguang Hu, Pengyuan Lyu, Kun Yao, Yuechen Yu, Yuliang Liu, Wanxiang Che, Errui Ding, Cheng-Lin Liu, Jiebo Luo, Shuicheng Yan, Min Zhang, Dimosthenis Karatzas, Xing Sun , et al. (2 additional authors not shown)

    Abstract: Structured text extraction is one of the most valuable and challenging application directions in the field of Document AI. However, the scenarios of past benchmarks are limited, and the corresponding evaluation protocols usually focus on the submodules of the structured text extraction scheme. In order to eliminate these problems, we organized the ICDAR 2023 competition on Structured text extracti… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: ICDAR 2023 Competition on SVRD report (To be appear in ICDAR 2023)

  30. arXiv:2305.12881  [pdf, other

    cs.CV cs.MM

    Building an Invisible Shield for Your Portrait against Deepfakes

    Authors: Jiazhi Guan, Tianshu Hu, Hang Zhou, Zhizhi Guo, Lirui Deng, Chengbin Quan, Errui Ding, Youjian Zhao

    Abstract: The issue of detecting deepfakes has garnered significant attention in the research community, with the goal of identifying facial manipulations for abuse prevention. Although recent studies have focused on developing generalized models that can detect various types of deepfakes, their performance is not always be reliable and stable, which poses limitations in real-world applications. Instead of… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: under review

  31. arXiv:2305.07713  [pdf, other

    cs.CV

    Multi-Modal 3D Object Detection by Box Matching

    Authors: Zhe Liu, Xiaoqing Ye, Zhikang Zou, Xinwei He, Xiao Tan, Errui Ding, Jingdong Wang, Xiang Bai

    Abstract: Multi-modal 3D object detection has received growing attention as the information from different sensors like LiDAR and cameras are complementary. Most fusion methods for 3D detection rely on an accurate alignment and calibration between 3D point clouds and RGB images. However, such an assumption is not reliable in a real-world self-driving system, as the alignment between different modalities is… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

  32. arXiv:2305.05445  [pdf, other

    cs.CV cs.GR cs.MM

    StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator

    Authors: Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu Hu, Kaisiyuan Wang, Dongliang He, Haocheng Feng, Jingtuo Liu, Errui Ding, Ziwei Liu, Jingdong Wang

    Abstract: Despite recent advances in syncing lip movements with any audio waves, current methods still struggle to balance generation quality and the model's generalization ability. Previous studies either require long-term data for training or produce a similar movement pattern on all subjects with low quality. In this paper, we propose StyleSync, an effective framework that enables high-fidelity lip synch… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. Project page: https://hangz-nju-cuhk.github.io/projects/StyleSync

  33. arXiv:2303.15334  [pdf, other

    cs.CV

    ByteTrackV2: 2D and 3D Multi-Object Tracking by Associating Every Detection Box

    Authors: Yifu Zhang, Xinggang Wang, Xiaoqing Ye, Wei Zhang, Jincheng Lu, Xiao Tan, Errui Ding, Peize Sun, Jingdong Wang

    Abstract: Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects across video frames. Detection boxes serve as the basis of both 2D and 3D MOT. The inevitable changing of detection scores leads to object missing after tracking. We propose a hierarchical data association strategy to mine the true objects in low-score detection boxes, which alleviates the problems of object mi… ▽ More

    Submitted 27 March, 2023; originally announced March 2023.

    Comments: Code is available at https://github.com/ifzhang/ByteTrack-V2. arXiv admin note: text overlap with arXiv:2110.06864; substantial text overlap with arXiv:2203.06424 by other authors

  34. arXiv:2303.14960  [pdf, other

    cs.CV

    Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection

    Authors: Chang Liu, Weiming Zhang, Xiangru Lin, Wei Zhang, Xiao Tan, Junyu Han, Xiaomao Li, Errui Ding, Jingdong Wang

    Abstract: With basic Semi-Supervised Object Detection (SSOD) techniques, one-stage detectors generally obtain limited promotions compared with two-stage clusters. We experimentally find that the root lies in two kinds of ambiguities: (1) Selection ambiguity that selected pseudo labels are less accurate, since classification scores cannot properly represent the localization quality. (2) Assignment ambiguity… ▽ More

    Submitted 27 March, 2023; originally announced March 2023.

    Comments: Accepted to CVPR 2023

  35. arXiv:2303.10209  [pdf, other

    cs.CV

    CAPE: Camera View Position Embedding for Multi-View 3D Object Detection

    Authors: Kaixin Xiong, Shi Gong, Xiaoqing Ye, Xiao Tan, Ji Wan, Errui Ding, Jingdong Wang, Xiang Bai

    Abstract: In this paper, we address the problem of detecting 3D objects from multi-view images. Current query-based methods rely on global 3D position embeddings (PE) to learn the geometric correspondence between images and 3D space. We claim that directly interacting 2D image features with global 3D PE could increase the difficulty of learning view transformation due to the variation of camera extrinsics.… ▽ More

    Submitted 17 March, 2023; originally announced March 2023.

    Comments: Accepted by CVPR2023. Code is available

  36. arXiv:2303.09187  [pdf, other

    cs.CV

    PSVT: End-to-End Multi-person 3D Pose and Shape Estimation with Progressive Video Transformers

    Authors: Zhongwei Qiu, Yang Qiansheng, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Chang Xu, Dongmei Fu, Jingdong Wang

    Abstract: Existing methods of multi-person video 3D human Pose and Shape Estimation (PSE) typically adopt a two-stage strategy, which first detects human instances in each frame and then performs single-person PSE with temporal model. However, the global spatio-temporal context among spatial instances can not be captured. In this paper, we propose a new end-to-end multi-person 3D Pose and Shape estimation f… ▽ More

    Submitted 16 March, 2023; originally announced March 2023.

    Comments: CVPR2023

  37. arXiv:2303.04970  [pdf, other

    cs.CV

    LMR: A Large-Scale Multi-Reference Dataset for Reference-based Super-Resolution

    Authors: Lin Zhang, Xin Li, Dongliang He, Errui Ding, Zhaoxiang Zhang

    Abstract: It is widely agreed that reference-based super-resolution (RefSR) achieves superior results by referring to similar high quality images, compared to single image super-resolution (SISR). Intuitively, the more references, the better performance. However, previous RefSR methods have all focused on single-reference image training, while multiple reference images are often available in testing or prac… ▽ More

    Submitted 8 March, 2023; originally announced March 2023.

    Comments: 6 figures, 10 pages

  38. arXiv:2303.02091  [pdf, other

    cs.CV

    Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement

    Authors: Jiaxiang Tang, Hang Zhou, Xiaokang Chen, Tianshu Hu, Errui Ding, Jingdong Wang, Gang Zeng

    Abstract: Neural Radiance Fields (NeRF) have constituted a remarkable breakthrough in image-based 3D reconstruction. However, their implicit volumetric representations differ significantly from the widely-adopted polygonal meshes and lack support from common 3D software and hardware, making their rendering and manipulation inefficient. To overcome this limitation, we present a novel framework that generates… ▽ More

    Submitted 19 August, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

    Comments: ICCV 2023 camera-ready, Project Page: https://me.kiui.moe/nerf2mesh

  39. arXiv:2303.00289  [pdf, other

    cs.CV

    StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

    Authors: Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

    Abstract: In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. T… ▽ More

    Submitted 1 March, 2023; originally announced March 2023.

    Comments: ICLR 2023

  40. arXiv:2302.13074  [pdf, other

    cs.CV

    Temporal Segment Transformer for Action Segmentation

    Authors: Zhichao Liu, Leshan Wang, Desen Zhou, Jian Wang, Songyang Zhang, Yang Bai, Errui Ding, Rui Fan

    Abstract: Recognizing human actions from untrimmed videos is an important task in activity understanding, and poses unique challenges in modeling long-range temporal relations. Recent works adopt a predict-and-refine strategy which converts an initial prediction to action segments for global context modeling. However, the generated segment representations are often noisy and exhibit inaccurate segment bound… ▽ More

    Submitted 25 February, 2023; originally announced February 2023.

  41. arXiv:2301.10900  [pdf, other

    cs.CV

    Graph Contrastive Learning for Skeleton-based Action Recognition

    Authors: Xiaohu Huang, Hao Zhou, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Jingdong Wang, Xinggang Wang, Wenyu Liu, Bin Feng

    Abstract: In the field of skeleton-based action recognition, current top-performing graph convolutional networks (GCNs) exploit intra-sequence context to construct adaptive graphs for feature aggregation. However, we argue that such context is still \textit{local} since the rich cross-sequence relations have not been explicitly investigated. In this paper, we propose a graph contrastive learning framework f… ▽ More

    Submitted 10 June, 2023; v1 submitted 25 January, 2023; originally announced January 2023.

    Comments: Accepted by ICLR2023

  42. arXiv:2301.01615  [pdf, other

    cs.CV

    StereoDistill: Pick the Cream from LiDAR for Distilling Stereo-based 3D Object Detection

    Authors: Zhe Liu, Xiaoqing Ye, Xiao Tan, Errui Ding, Xiang Bai

    Abstract: In this paper, we propose a cross-modal distillation method named StereoDistill to narrow the gap between the stereo and LiDAR-based approaches via distilling the stereo detectors from the superior LiDAR model at the response level, which is usually overlooked in 3D object detection distillation. The key designs of StereoDistill are: the X-component Guided Distillation~(XGD) for regression and the… ▽ More

    Submitted 7 January, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

    Comments: Accepted by AAAI-2023

  43. arXiv:2212.04970  [pdf, other

    cs.CV cs.AI cs.GR

    Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

    Authors: Yasheng Sun, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Zhibin Hong, Jingtuo Liu, Errui Ding, Jingdong Wang, Ziwei Liu, Hideki Koike

    Abstract: Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. However, most of them deform or generate the whole facial area, leading to non-realistic results. In this work, we delve into the formulation of altering only the mouth shapes of the target person. This requires masking a large percentage of the original image and seamlessly… ▽ More

    Submitted 9 December, 2022; originally announced December 2022.

    Comments: Accepted to SIGGRAPH Asia 2022 (Conference Proceedings). Project page: https://hangz-nju-cuhk.github.io/projects/AV-CAT

  44. arXiv:2212.03651  [pdf, other

    cs.CV

    Cyclically Disentangled Feature Translation for Face Anti-spoofing

    Authors: Haixiao Yue, Keyao Wang, Guosheng Zhang, Haocheng Feng, Junyu Han, Errui Ding, Jingdong Wang

    Abstract: Current domain adaptation methods for face anti-spoofing leverage labeled source domain data and unlabeled target domain data to obtain a promising generalizable decision boundary. However, it is usually difficult for these methods to achieve a perfect domain-invariant liveness feature disentanglement, which may degrade the final classification performance by domain differences in illumination, fa… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

    Comments: Accepted by AAAI2023

  45. arXiv:2211.09799  [pdf, other

    cs.CV

    CAE v2: Context Autoencoder with CLIP Target

    Authors: Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

    Abstract: Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM. However, it is still under-explored how CLIP supervision in MIM influences performance. To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., t… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

  46. arXiv:2211.08071  [pdf, other

    cs.CV

    Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling

    Authors: Yu Wang, Xin Li, Shengzhao Wen, Fukui Yang, Wanping Zhang, Gang Zhang, Haocheng Feng, Junyu Han, Errui Ding

    Abstract: DETR is a novel end-to-end transformer architecture object detector, which significantly outperforms classic detectors when scaling up the model size. In this paper, we focus on the compression of DETR with knowledge distillation. While knowledge distillation has been well-studied in classic detectors, there is a lack of researches on how to make it work effectively on DETR. We first provide exper… ▽ More

    Submitted 15 November, 2022; v1 submitted 15 November, 2022; originally announced November 2022.

  47. arXiv:2211.03594  [pdf, ps, other

    cs.CV

    Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

    Authors: Qiang Chen, Jian Wang, Chuchu Han, Shan Zhang, Zexian Li, Xiaokang Chen, Jiahui Chen, Xiaodi Wang, Shuming Han, Gang Zhang, Haocheng Feng, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

    Abstract: We present a strong object detector with encoder-decoder pretraining and finetuning. Our method, called Group DETR v2, is built upon a vision transformer encoder ViT-Huge~\cite{dosovitskiy2020image}, a DETR variant DINO~\cite{zhang2022dino}, and an efficient DETR training method Group DETR~\cite{chen2022group}. The training process consists of self-supervised pretraining and finetuning a ViT-Huge… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: Tech report, 3 pages. We establishes a new SoTA (64.5 mAP) on the COCO test-dev

  48. arXiv:2210.07140  [pdf, other

    cs.CV

    U-HRNet: Delving into Improving Semantic Representation of High Resolution Network for Dense Prediction

    Authors: Jian Wang, Xiang Long, Guowei Chen, Zewu Wu, Zeyu Chen, Errui Ding

    Abstract: High resolution and advanced semantic representation are both vital for dense prediction. Empirically, low-resolution feature maps often achieve stronger semantic representation, and high-resolution feature maps generally can better identify local features such as edges, but contains weaker semantic information. Existing state-of-the-art frameworks such as HRNet has kept low-resolution and high-re… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Comments: TechReport

  49. arXiv:2210.07124  [pdf, other

    cs.CV

    RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer

    Authors: Jian Wang, Chenhui Gou, Qiman Wu, Haocheng Feng, Junyu Han, Errui Ding, Jingdong Wang

    Abstract: Recently, transformer-based networks have shown impressive results in semantic segmentation. Yet for real-time semantic segmentation, pure CNN-based approaches still dominate in this field, due to the time-consuming computation mechanism of transformer. We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation, which achieves better trade-off between performa… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Comments: NeurIPS2022

  50. arXiv:2210.05097  [pdf

    cs.CV

    Repainting and Imitating Learning for Lane Detection

    Authors: Yue He, Minyue Jiang, Xiaoqing Ye, Liang Du, Zhikang Zou, Wei Zhang, Xiao Tan, Errui Ding

    Abstract: Current lane detection methods are struggling with the invisibility lane issue caused by heavy shadows, severe road mark degradation, and serious vehicle occlusion. As a result, discriminative lane features can be barely learned by the network despite elaborate designs due to the inherent invisibility of lanes in the wild. In this paper, we target at finding an enhanced feature space where the lan… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.