-
Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving
Authors:
Ran Tian,
Boyi Li,
Xinshuo Weng,
Yuxiao Chen,
Edward Schmerling,
Yue Wang,
Boris Ivanovic,
Marco Pavone
Abstract:
The autonomous driving industry is increasingly adopting end-to-end learning from sensory inputs to minimize human biases in system design. Traditional end-to-end driving models, however, suffer from long-tail events due to rare or unseen inputs within their training distributions. To address this, we propose TOKEN, a novel Multi-Modal Large Language Model (MM-LLM) that tokenizes the world into ob…
▽ More
The autonomous driving industry is increasingly adopting end-to-end learning from sensory inputs to minimize human biases in system design. Traditional end-to-end driving models, however, suffer from long-tail events due to rare or unseen inputs within their training distributions. To address this, we propose TOKEN, a novel Multi-Modal Large Language Model (MM-LLM) that tokenizes the world into object-level knowledge, enabling better utilization of LLM's reasoning capabilities to enhance autonomous vehicle planning in long-tail scenarios. TOKEN effectively alleviates data scarcity and inefficient tokenization by leveraging a traditional end-to-end driving model to produce condensed and semantically enriched representations of the scene, which are optimized for LLM planning compatibility through deliberate representation and reasoning alignment training stages. Our results demonstrate that TOKEN excels in grounding, reasoning, and planning capabilities, outperforming existing frameworks with a 27% reduction in trajectory L2 error and a 39% decrease in collision rates in long-tail scenarios. Additionally, our work highlights the importance of representation alignment and structured reasoning in sparking the common-sense reasoning capabilities of MM-LLMs for effective planning.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
Authors:
Daniel Dauner,
Marcel Hallgarten,
Tianyu Li,
Xinshuo Weng,
Zhiyu Huang,
Zetong Yang,
Hongyang Li,
Igor Gilitschenski,
Boris Ivanovic,
Marco Pavone,
Andreas Geiger,
Kashyap Chitta
Abstract:
Benchmarking vision-based driving policies is challenging. On one hand, open-loop evaluation with real data is easy, but these results do not reflect closed-loop performance. On the other, closed-loop evaluation is possible in simulation, but is hard to scale due to its significant computational demands. Further, the simulators available today exhibit a large domain gap to real data. This has resu…
▽ More
Benchmarking vision-based driving policies is challenging. On one hand, open-loop evaluation with real data is easy, but these results do not reflect closed-loop performance. On the other, closed-loop evaluation is possible in simulation, but is hard to scale due to its significant computational demands. Further, the simulators available today exhibit a large domain gap to real data. This has resulted in an inability to draw clear conclusions from the rapidly growing body of research on end-to-end autonomous driving. In this paper, we present NAVSIM, a middle ground between these evaluation paradigms, where we use large datasets in combination with a non-reactive simulator to enable large-scale real-world benchmarking. Specifically, we gather simulation-based metrics, such as progress and time to collision, by unrolling bird's eye view abstractions of the test scenes for a short simulation horizon. Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other. As we demonstrate empirically, this decoupling allows open-loop metric computation while being better aligned with closed-loop evaluations than traditional displacement errors. NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights. On a large set of challenging scenarios, we observe that simple methods with moderate compute requirements such as TransFuser can match recent large-scale end-to-end driving architectures such as UniAD. Our modular framework can potentially be extended with new datasets, data curation strategies, and metrics, and will be continually maintained to host future challenges. Our code is available at https://github.com/autonomousvision/navsim.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Meta-Backscatter: A New ISAC Paradigm for Battery-Free Internet of Things
Authors:
Xu Liu,
Hongliang Zhang,
Kaigui Bian,
Xi Weng,
Lingyang Song
Abstract:
The meta-material sensor has been regarded as a next-generation sensing technology for the battery-free Internet of Things (IoT) due to its battery-free characteristic and improved sensing performance. The meta-material sensors function as backscatter tags that change their reflection coefficients with the conditions of sensing targets such as temperature and gas concentration, allowing transceive…
▽ More
The meta-material sensor has been regarded as a next-generation sensing technology for the battery-free Internet of Things (IoT) due to its battery-free characteristic and improved sensing performance. The meta-material sensors function as backscatter tags that change their reflection coefficients with the conditions of sensing targets such as temperature and gas concentration, allowing transceivers to perform sensing by analyzing the reflected signals from the sensors. Simultaneously, the sensors also function as environmental scatterers, creating additional signal paths to enhance communication performance. Therefore, the meta-material sensor potentially provides a new paradigm of Integrated Sensing and Communication (ISAC) for the battery-free IoT system. In this article, we first propose a Meta-Backscatter system that utilizes meta-material sensors to achieve diverse sensing functionalities and improved communication performance. We begin with the introduction of the metamaterial sensor and further elaborate on the Meta-Backscatter system. Subsequently, we present optimization strategies for meta-material sensors, transmitters, and receivers to strike a balance between sensing and communication. Furthermore, this article provides a case study of the system and examines the feasibility and trade-off through the simulation results. Finally, potential extensions of the system and their related research challenges are addressed.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Angular Momentum-Resolved Inelastic Electron Scattering for Nuclear Giant Resonances
Authors:
Zhi-Wei Lu,
Liang Guo,
Mamutjan Ababekri,
Jia-lin Zhang,
Xiu-Feng Weng,
Yuanbin Wu,
Yi-Fei Niu,
Jian-Xing Li
Abstract:
Giant resonances (GRs) provide crucial insights into nuclear physics and astrophysics. Exciting GRs using particles like electrons is effective, yet the angular momentum (AM) transfer of electrons, including both intrinsic spin and orbital degrees of freedom in inelastic scattering, has never been studied. Here, we investigate AM transfer in GRs excited by plane-wave and vortex electrons, developi…
▽ More
Giant resonances (GRs) provide crucial insights into nuclear physics and astrophysics. Exciting GRs using particles like electrons is effective, yet the angular momentum (AM) transfer of electrons, including both intrinsic spin and orbital degrees of freedom in inelastic scattering, has never been studied. Here, we investigate AM transfer in GRs excited by plane-wave and vortex electrons, developing a comprehensive AM-resolved inelastic electron scattering theory. We find that even plane-wave electrons can model-independently extract transition strengths of higher multipolarity by selecting specific AM states of scattered electrons. Additionally, relativistic vortex electrons with orbital angular momentum (OAM) $\pm1$ can be efficiently generated. Vortex electrons can also be used to extract GR transition strength as in the plane-wave case, regardless of the position of nucleus relative to the beam axis. Furthermore, relativistic vortex electrons with larger OAM can be generated for on-axis nuclei due to AM conservation. Our method offers new perspectives for nuclear structure research and paves the way for generating vortex particles.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
Large low-field magnetocaloric response in a ferromagnetic gadolinium orthophosphate
Authors:
Ziyu W. Yang,
Jie Zhang,
Maocai Pi,
Xubin Ye,
Chenxu Kang,
Xiaoliang Weng,
Wei Tang,
Hongzhi Cui,
Yu-Jia Zeng,
Youwen Long
Abstract:
Bulk magnetic and thermodynamic measurements, along with mean-field calculations, were conducted on the ferromagnetic K3Gd5(PO4)6 powders. No magnetic ordering was observed until 2 K, while the application of an external field B > 1 T resulted in the splitting of the Gd3+ ground state multiplet and induced a non-cooperative Schottky effect. The average nearest-neighbor exchange strength |J1/kB| is…
▽ More
Bulk magnetic and thermodynamic measurements, along with mean-field calculations, were conducted on the ferromagnetic K3Gd5(PO4)6 powders. No magnetic ordering was observed until 2 K, while the application of an external field B > 1 T resulted in the splitting of the Gd3+ ground state multiplet and induced a non-cooperative Schottky effect. The average nearest-neighbor exchange strength |J1/kB| is determined to be 0.017 K, which leads to a remarkably large low field magnetic entropy change ΔSm = 36.2 J kg-1 K-1 under applied field change B = 2 T at temperature T = 2 K, as well as a maximum adiabatic temperature change Tad = 10.9 K. We contend that ferromagnetic gadolinium orthophosphates serve as a promising reservoir for exploring advanced magnetic refrigerants applicable under low magnetic fields.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
Heavy baryons in the relativized quark model with chromodynamics
Authors:
Xin-Zhen Weng,
Wei-Zhen Deng,
Shi-Lin Zhu
Abstract:
Following the work of Capstick and Isgur [\href{https://doi.org/10.1103/PhysRevD.34.2809}{Phys.~Rev.~D~34,~2809~(1986)}], we systematically study the mass spectrum of the heavy baryons in the relativized quark potential model with chromodynamics. Besides the original Godfrey-Isgur (GI) model, we also adopt a modified GI model which replaces the linear confinement by a screened one. The two models…
▽ More
Following the work of Capstick and Isgur [\href{https://doi.org/10.1103/PhysRevD.34.2809}{Phys.~Rev.~D~34,~2809~(1986)}], we systematically study the mass spectrum of the heavy baryons in the relativized quark potential model with chromodynamics. Besides the original Godfrey-Isgur (GI) model, we also adopt a modified GI model which replaces the linear confinement by a screened one. The two models give similar results in our work. All heavy baryons observed so far can be explained as three-quark states. In particular, we identify the $Ω_{c}(3000)$/$Ω_{b}(6316)$, $Ω_{c}(3050)$/$Ω_{b}(6330)$, $Ω_{c}(3065)$/$Ω_{b}(6340)$ and $Ω_{c}(3090)$/$Ω_{b}(6350)$ states as the $p_λ$ excitations with quantum numbers $1/2^{-}$, $3/2^{-}$, $3/2^{-}$ and $5/2^{-}$. The $Ω_{c}(3120)$ is a $3/2^{-}$ state with the $p_ρ$ excitation, whose bottom partner is predicted to be $Ω_{b}(6446/6457,3/2^{-})$. The higher state $Ω_{c}(3188)$ is the $2s_λ$ excitation with quantum numbers $1/2^{+}$, and $Ω_{c}(3327)$ is a $d_λ$ excitation with quantum numbers $3/2^{+}$ or $5/2^{+}$. In addition, the $Λ_{c}(2940)$ with quantum numbers $J^{P}=3/2^{-}$ could be explained as the $p_ρ$ excitation.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models
Authors:
Junlong Jia,
Ying Hu,
Xi Weng,
Yiming Shi,
Miao Li,
Xingjian Zhang,
Baichuan Zhou,
Ziyu Liu,
Jie Luo,
Lei Huang,
Ji Wu
Abstract:
We present TinyLLaVA Factory, an open-source modular codebase for small-scale large multimodal models (LMMs) with a focus on simplicity of code implementations, extensibility of new features, and reproducibility of training results. Following the design philosophy of the factory pattern in software engineering, TinyLLaVA Factory modularizes the entire system into interchangeable components, with e…
▽ More
We present TinyLLaVA Factory, an open-source modular codebase for small-scale large multimodal models (LMMs) with a focus on simplicity of code implementations, extensibility of new features, and reproducibility of training results. Following the design philosophy of the factory pattern in software engineering, TinyLLaVA Factory modularizes the entire system into interchangeable components, with each component integrating a suite of cutting-edge models and methods, meanwhile leaving room for extensions to more features. In addition to allowing users to customize their own LMMs, TinyLLaVA Factory provides popular training recipes to let users pretrain and finetune their models with less coding effort. Empirical experiments validate the effectiveness of our codebase. The goal of TinyLLaVA Factory is to assist researchers and practitioners in exploring the wide landscape of designing and training small-scale LMMs with affordable computational resources.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Language-Image Models with 3D Understanding
Authors:
Jang Hyun Cho,
Boris Ivanovic,
Yulong Cao,
Edward Schmerling,
Yue Wang,
Xinshuo Weng,
Boyi Li,
Yurong You,
Philipp Krähenbühl,
Yan Wang,
Marco Pavone
Abstract:
Multi-modal large language models (MLLMs) have shown incredible capabilities in a variety of 2D vision and language tasks. We extend MLLMs' perceptual capabilities to ground and reason about images in 3-dimensional space. To that end, we first develop a large-scale pre-training dataset for 2D and 3D called LV3D by combining multiple existing 2D and 3D recognition datasets under a common task formu…
▽ More
Multi-modal large language models (MLLMs) have shown incredible capabilities in a variety of 2D vision and language tasks. We extend MLLMs' perceptual capabilities to ground and reason about images in 3-dimensional space. To that end, we first develop a large-scale pre-training dataset for 2D and 3D called LV3D by combining multiple existing 2D and 3D recognition datasets under a common task formulation: as multi-turn question-answering. Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective. Cube-LLM exhibits intriguing properties similar to LLMs: (1) Cube-LLM can apply chain-of-thought prompting to improve 3D understanding from 2D context information. (2) Cube-LLM can follow complex and diverse instructions and adapt to versatile input and output formats. (3) Cube-LLM can be visually prompted such as 2D box or a set of candidate 3D boxes from specialists. Our experiments on outdoor benchmarks demonstrate that Cube-LLM significantly outperforms existing baselines by 21.3 points of AP-BEV on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the DriveLM dataset for complex reasoning about driving scenarios, respectively. Cube-LLM also shows competitive results in general MLLM benchmarks such as refCOCO for 2D grounding with (87.0) average score, as well as visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. for complex reasoning. Our project is available at https://janghyuncho.github.io/Cube-LLM.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model
Authors:
Chao Pang,
Jiang Wu,
Jiayu Li,
Yi Liu,
Jiaxing Sun,
Weijia Li,
Xingxing Weng,
Shuai Wang,
Litong Feng,
Gui-Song Xia,
Conghui He
Abstract:
The generic large Vision-Language Models (VLMs) is rapidly developing, but still perform poorly in Remote Sensing (RS) domain, which is due to the unique and specialized nature of RS imagery and the comparatively limited spatial perception of current VLMs. Existing Remote Sensing specific Vision Language Models (RSVLMs) still have considerable potential for improvement, primarily owing to the lack…
▽ More
The generic large Vision-Language Models (VLMs) is rapidly developing, but still perform poorly in Remote Sensing (RS) domain, which is due to the unique and specialized nature of RS imagery and the comparatively limited spatial perception of current VLMs. Existing Remote Sensing specific Vision Language Models (RSVLMs) still have considerable potential for improvement, primarily owing to the lack of large-scale, high-quality RS vision-language datasets. We constructed HqDC-1.4M, the large scale High quality and Detailed Captions for RS images, containing 1.4 million image-caption pairs, which not only enhance the RSVLM's understanding of RS images but also significantly improve the model's spatial perception abilities, such as localization and counting, thereby increasing the helpfulness of the RSVLM. Moreover, to address the inevitable "hallucination" problem in RSVLM, we developed RSSA, the first dataset aimed at enhancing the Self-Awareness capability of RSVLMs. By incorporating a variety of unanswerable questions into typical RS visual question-answering tasks, RSSA effectively improves the truthfulness and reduces the hallucinations of the model's outputs, thereby enhancing the honesty of the RSVLM. Based on these datasets, we proposed the H2RSVLM, the Helpful and Honest Remote Sensing Vision Language Model. H2RSVLM has achieved outstanding performance on multiple RS public datasets and is capable of recognizing and refusing to answer the unanswerable questions, effectively mitigating the incorrect generations. We will release the code, data and model weights at https://github.com/opendatalab/H2RSVLM .
△ Less
Submitted 29 March, 2024;
originally announced March 2024.
-
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Authors:
Baichuan Zhou,
Ying Hu,
Xi Weng,
Junlong Jia,
Jie Luo,
Xien Liu,
Ji Wu,
Lei Huang
Abstract:
We present the TinyLLaVA framework that provides a unified perspective in designing and analyzing the small-scale Large Multimodal Models (LMMs). We empirically study the effects of different vision encoders, connection modules, language models, training data and training recipes. Our extensive experiments showed that better quality of data combined with better training recipes, smaller LMMs can c…
▽ More
We present the TinyLLaVA framework that provides a unified perspective in designing and analyzing the small-scale Large Multimodal Models (LMMs). We empirically study the effects of different vision encoders, connection modules, language models, training data and training recipes. Our extensive experiments showed that better quality of data combined with better training recipes, smaller LMMs can consistently achieve on-par performances compared to bigger LMMs. Under our framework, we train a family of small-scale LMMs. Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL. We hope our findings can serve as baselines for future research in terms of data scaling, training setups and model selections. Our model weights and codes will be made public.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
HiCD: Change Detection in Quality-Varied Images via Hierarchical Correlation Distillation
Authors:
Chao Pang,
Xingxing Weng,
Jiang Wu,
Qiang Wang,
Gui-Song Xia
Abstract:
Advanced change detection techniques primarily target image pairs of equal and high quality. However, variations in imaging conditions and platforms frequently lead to image pairs with distinct qualities: one image being high-quality, while the other being low-quality. These disparities in image quality present significant challenges for understanding image pairs semantically and extracting change…
▽ More
Advanced change detection techniques primarily target image pairs of equal and high quality. However, variations in imaging conditions and platforms frequently lead to image pairs with distinct qualities: one image being high-quality, while the other being low-quality. These disparities in image quality present significant challenges for understanding image pairs semantically and extracting change features, ultimately resulting in a notable decline in performance. To tackle this challenge, we introduce an innovative training strategy grounded in knowledge distillation. The core idea revolves around leveraging task knowledge acquired from high-quality image pairs to guide the model's learning process when dealing with image pairs that exhibit differences in quality. Additionally, we develop a hierarchical correlation distillation approach (involving self-correlation, cross-correlation, and global correlation). This approach compels the student model to replicate the correlations inherent in the teacher model, rather than focusing solely on individual features. This ensures effective knowledge transfer while maintaining the student model's training flexibility.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Towards End-to-End GPS Localization with Neural Pseudorange Correction
Authors:
Xu Weng,
KV Ling,
Haochen Liu,
Kun Cao
Abstract:
Pseudorange errors are the root cause of localization inaccuracy in GPS. Previous data-driven methods regress and eliminate pseudorange errors using handcrafted intermediate labels. Unlike them, we propose an end-to-end GPS localization framework, E2E-PrNet, to train a neural network for pseudorange correction (PrNet) directly using the final task loss calculated with the ground truth of GPS recei…
▽ More
Pseudorange errors are the root cause of localization inaccuracy in GPS. Previous data-driven methods regress and eliminate pseudorange errors using handcrafted intermediate labels. Unlike them, we propose an end-to-end GPS localization framework, E2E-PrNet, to train a neural network for pseudorange correction (PrNet) directly using the final task loss calculated with the ground truth of GPS receiver states. The gradients of the loss with respect to learnable parameters are backpropagated through a differentiable nonlinear least squares optimizer to PrNet. The feasibility is verified with GPS data collected by Android phones, showing that E2E-PrNet outperforms the state-of-the-art end-to-end GPS localization methods.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Properties of $Q^{5}q$ dibaryons
Authors:
Xin-Zhen Weng
Abstract:
We investigate heavy flavor dibaryons with five heavy quarks $Q$ ($Q=\{c,b\}$) and one light quark $q$ ($q=\{u,d,s\}$), namely the $Q^{5}q$ dibaryons. In the framework of an extended chromomagnetic model, we systematically study the mass spectrum of these dibaryons. We find no stable state below the corresponding baryon-baryon thresholds. In addition to the analysis of the masses, we also study th…
▽ More
We investigate heavy flavor dibaryons with five heavy quarks $Q$ ($Q=\{c,b\}$) and one light quark $q$ ($q=\{u,d,s\}$), namely the $Q^{5}q$ dibaryons. In the framework of an extended chromomagnetic model, we systematically study the mass spectrum of these dibaryons. We find no stable state below the corresponding baryon-baryon thresholds. In addition to the analysis of the masses, we also study their two body decay properties by estimating the relative width ratios of the decay channels. We hope our study will be of help for future experiments.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
Visual Tomography: Physically Faithful Volumetric Models of Partially Translucent Objects
Authors:
David Nakath,
Xiangyu Weng,
Mengkun She,
Kevin Köser
Abstract:
When created faithfully from real-world data, Digital 3D representations of objects can be useful for human or computer-assisted analysis. Such models can also serve for generating training data for machine learning approaches in settings where data is difficult to obtain or where too few training data exists, e.g. by providing novel views or images in varying conditions. While the vast amount of…
▽ More
When created faithfully from real-world data, Digital 3D representations of objects can be useful for human or computer-assisted analysis. Such models can also serve for generating training data for machine learning approaches in settings where data is difficult to obtain or where too few training data exists, e.g. by providing novel views or images in varying conditions. While the vast amount of visual 3D reconstruction approaches focus on non-physical models, textured object surfaces or shapes, in this contribution we propose a volumetric reconstruction approach that obtains a physical model including the interior of partially translucent objects such as plankton or insects. Our technique photographs the object under different poses in front of a bright white light source and computes absorption and scattering per voxel. It can be interpreted as visual tomography that we solve by inverse raytracing. We additionally suggest a method to convert non-physical NeRF media into a physically-based volumetric grid for initialization and illustrate the usefulness of the approach using two real-world plankton validation sets, the lab-scanned models being finally also relighted and virtually submerged in a scenario with augmented medium and illumination conditions. Please visit the project homepage at www.marine.informatik.uni-kiel.de/go/vito
△ Less
Submitted 20 December, 2023;
originally announced December 2023.
-
Augmenting Lane Perception and Topology Understanding with Standard Definition Navigation Maps
Authors:
Katie Z Luo,
Xinshuo Weng,
Yan Wang,
Shuang Wu,
Jie Li,
Kilian Q Weinberger,
Yue Wang,
Marco Pavone
Abstract:
Autonomous driving has traditionally relied heavily on costly and labor-intensive High Definition (HD) maps, hindering scalability. In contrast, Standard Definition (SD) maps are more affordable and have worldwide coverage, offering a scalable alternative. In this work, we systematically explore the effect of SD maps for real-time lane-topology understanding. We propose a novel framework to integr…
▽ More
Autonomous driving has traditionally relied heavily on costly and labor-intensive High Definition (HD) maps, hindering scalability. In contrast, Standard Definition (SD) maps are more affordable and have worldwide coverage, offering a scalable alternative. In this work, we systematically explore the effect of SD maps for real-time lane-topology understanding. We propose a novel framework to integrate SD maps into online map prediction and propose a Transformer-based encoder, SD Map Encoder Representations from transFormers, to leverage priors in SD maps for the lane-topology prediction task. This enhancement consistently and significantly boosts (by up to 60%) lane detection and topology prediction on current state-of-the-art online map prediction methods without bells and whistles and can be immediately incorporated into any Transformer-based lane-topology method. Code is available at https://github.com/NVlabs/SMERF.
△ Less
Submitted 7 November, 2023;
originally announced November 2023.
-
EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision
Authors:
Jiawei Yang,
Boris Ivanovic,
Or Litany,
Xinshuo Weng,
Seung Wook Kim,
Boyi Li,
Tong Che,
Danfei Xu,
Sanja Fidler,
Marco Pavone,
Yue Wang
Abstract:
We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from…
▽ More
We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from self-supervision, enabling our model to learn from general, in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field from the dynamic field and uses this flow field to further aggregate multi-frame features, amplifying the rendering precision of dynamic objects. Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to represent highly-dynamic scenes self-sufficiently, without relying on ground truth object annotations or pre-trained models for dynamic object segmentation or optical flow estimation. Our method achieves state-of-the-art performance in sensor simulation, significantly outperforming previous methods when reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual foundation model features into 4D space-time and address a general positional bias in modern Transformers, significantly boosting 3D perception performance (e.g., 37.50% relative improvement in occupancy prediction accuracy on average). Finally, we construct a diverse and challenging 120-sequence dataset to benchmark neural fields under extreme and highly-dynamic settings.
△ Less
Submitted 3 November, 2023;
originally announced November 2023.
-
Generation of $γ$ photons with extremely large orbital angular momenta
Authors:
Ren-Tong Guo,
Mamutjan Ababekri,
Qian Zhao,
Yousef I. Salamin,
Liang-Liang Ji,
Zhi-Gang Bu,
Zhong-Feng Xu,
Xiu-Feng Weng,
Jian-Xing Li
Abstract:
Vortex $γ$ photons, which carry large intrinsic orbital angular momenta (OAM), have significant applications in nuclear, atomic, hadron, particle and astro-physics, but their production remains unclear. In this work, we investigate the generation of such photons from nonlinear Compton scattering of circularly polarized monochromatic lasers on vortex electrons. We develop a quantum radiation theory…
▽ More
Vortex $γ$ photons, which carry large intrinsic orbital angular momenta (OAM), have significant applications in nuclear, atomic, hadron, particle and astro-physics, but their production remains unclear. In this work, we investigate the generation of such photons from nonlinear Compton scattering of circularly polarized monochromatic lasers on vortex electrons. We develop a quantum radiation theory for ultrarelativistic vortex electrons in lasers by using the harmonics expansion and spin eigenfunctions, which allows us to explore the kinematical characteristics, angular momentum transfer mechanisms, and formation conditions of vortex $γ$ photons. The multiphoton absorption of electrons enables the vortex $γ$ photons, with fixed polarizations and energies, to exist in mixed states comprised of multiple harmonics. Each harmonic represents a vortex eigenmode and has transverse momentum broadening due to transverse momenta of the vortex electrons. The large topological charges associated with vortex electrons offer the possibility for $γ$ photons to carry adjustable OAM quantum numbers from tens to thousands of units, even at moderate laser intensities. $γ$ photons with large OAM and transverse coherence length can assist in influencing quantum selection rules and extracting phase of the scattering amplitude in scattering processes.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
PrNet: A Neural Network for Correcting Pseudoranges to Improve Positioning with Android Raw GNSS Measurements
Authors:
Xu Weng,
Keck Voon Ling,
Haochen Liu
Abstract:
We present a neural network for mitigating biased errors in pseudoranges to improve localization performance with data collected from mobile phones. A satellite-wise Multilayer Perceptron (MLP) is designed to regress the pseudorange bias correction from six satellite, receiver, context-related features derived from Android raw Global Navigation Satellite System (GNSS) measurements. To train the ML…
▽ More
We present a neural network for mitigating biased errors in pseudoranges to improve localization performance with data collected from mobile phones. A satellite-wise Multilayer Perceptron (MLP) is designed to regress the pseudorange bias correction from six satellite, receiver, context-related features derived from Android raw Global Navigation Satellite System (GNSS) measurements. To train the MLP, we carefully calculate the target values of pseudorange bias using location ground truth and smoothing techniques and optimize a loss function involving the estimation residuals of smartphone clock bias. The corrected pseudoranges are then used by a model-based localization engine to compute locations. The Google Smartphone Decimeter Challenge (GSDC) dataset, which contains Android smartphone data collected from both rural and urban areas, is utilized for evaluation. Both fingerprinting and cross-trace localization results demonstrate that our proposed method outperforms model-based and state-of-the-art data-driven approaches.
△ Less
Submitted 22 December, 2023; v1 submitted 16 September, 2023;
originally announced September 2023.
-
Localization with Noisy Android Raw GNSS Measurements
Authors:
Xu Weng,
Keck Voon Ling
Abstract:
Android raw Global Navigation Satellite System (GNSS) measurements are expected to bring smartphones power to take on demanding localization tasks that are traditionally performed by specialized GNSS receivers. The hardware constraints, however, make Android raw GNSS measurements much noisier than geodetic-quality ones. This study elucidates the principles of localization using Android raw GNSS me…
▽ More
Android raw Global Navigation Satellite System (GNSS) measurements are expected to bring smartphones power to take on demanding localization tasks that are traditionally performed by specialized GNSS receivers. The hardware constraints, however, make Android raw GNSS measurements much noisier than geodetic-quality ones. This study elucidates the principles of localization using Android raw GNSS measurements and leverages Moving Horizon Estimation (MHE), Extended Kalman Filter (EKF), and Rauch-Tung-Striebel (RTS) smoother for noise suppression. Experimental results show that the RTS smoother achieves the best positioning performance, with horizontal positioning errors significantly reduced by 76.4% and 46.5% in static and dynamic scenarios compared with the baseline weighted least squares (WLS) method. Our codes are available at https://github.com/ailocar/androidGnss.
△ Less
Submitted 28 September, 2023; v1 submitted 16 September, 2023;
originally announced September 2023.
-
Language Conditioned Traffic Generation
Authors:
Shuhan Tan,
Boris Ivanovic,
Xinshuo Weng,
Marco Pavone,
Philipp Kraehenbuehl
Abstract:
Simulation forms the backbone of modern self-driving development. Simulators help develop, test, and improve driving systems without putting humans, vehicles, or their environment at risk. However, simulators face a major challenge: They rely on realistic, scalable, yet interesting content. While recent advances in rendering and scene reconstruction make great strides in creating static scene asse…
▽ More
Simulation forms the backbone of modern self-driving development. Simulators help develop, test, and improve driving systems without putting humans, vehicles, or their environment at risk. However, simulators face a major challenge: They rely on realistic, scalable, yet interesting content. While recent advances in rendering and scene reconstruction make great strides in creating static scene assets, modeling their layout, dynamics, and behaviors remains challenging. In this work, we turn to language as a source of supervision for dynamic traffic scene generation. Our model, LCTGen, combines a large language model with a transformer-based decoder architecture that selects likely map locations from a dataset of maps, and produces an initial traffic distribution, as well as the dynamics of each vehicle. LCTGen outperforms prior work in both unconditional and conditional traffic scene generation in terms of realism and fidelity. Code and video will be available at https://ariostgx.github.io/lctgen.
△ Less
Submitted 16 July, 2023;
originally announced July 2023.
-
DAM-Net: Global Flood Detection from SAR Imagery Using Differential Attention Metric-Based Vision Transformers
Authors:
Tamer Saleh,
Xingxing Weng,
Shimaa Holail,
Chen Hao,
Gui-Song Xia
Abstract:
The detection of flooded areas using high-resolution synthetic aperture radar (SAR) imagery is a critical task with applications in crisis and disaster management, as well as environmental resource planning. However, the complex nature of SAR images presents a challenge that often leads to an overestimation of the flood extent. To address this issue, we propose a novel differential attention metri…
▽ More
The detection of flooded areas using high-resolution synthetic aperture radar (SAR) imagery is a critical task with applications in crisis and disaster management, as well as environmental resource planning. However, the complex nature of SAR images presents a challenge that often leads to an overestimation of the flood extent. To address this issue, we propose a novel differential attention metric-based network (DAM-Net) in this study. The DAM-Net comprises two key components: a weight-sharing Siamese backbone to obtain multi-scale change features of multi-temporal images and tokens containing high-level semantic information of water-body changes, and a temporal differential fusion (TDF) module that integrates semantic tokens and change features to generate flood maps with reduced speckle noise. Specifically, the backbone is split into multiple stages. In each stage, we design three modules, namely, temporal-wise feature extraction (TWFE), cross-temporal change attention (CTCA), and temporal-aware change enhancement (TACE), to effectively extract the change features. In TACE of the last stage, we introduce a class token to record high-level semantic information of water-body changes via the attention mechanism. Another challenge faced by data-driven deep learning algorithms is the limited availability of flood detection datasets. To overcome this, we have created the S1GFloods open-source dataset, a global-scale high-resolution Sentinel-1 SAR image pairs dataset covering 46 global flood events between 2015 and 2022. The experiments on the S1GFloods dataset using the proposed DAM-Net showed top results compared to state-of-the-art methods in terms of overall accuracy, F1-score, and IoU, which reached 97.8%, 96.5%, and 93.2%, respectively. Our dataset and code will be available online at https://github.com/Tamer-Saleh/S1GFlood-Detection.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
Modulate Your Spectrum in Self-Supervised Learning
Authors:
Xi Weng,
Yunhao Ni,
Tengwei Song,
Jie Luo,
Rao Muhammad Anwer,
Salman Khan,
Fahad Shahbaz Khan,
Lei Huang
Abstract:
Whitening loss offers a theoretical guarantee against feature collapse in self-supervised learning (SSL) with joint embedding architectures. Typically, it involves a hard whitening approach, transforming the embedding and applying loss to the whitened output. In this work, we introduce Spectral Transformation (ST), a framework to modulate the spectrum of embedding and to seek for functions beyond…
▽ More
Whitening loss offers a theoretical guarantee against feature collapse in self-supervised learning (SSL) with joint embedding architectures. Typically, it involves a hard whitening approach, transforming the embedding and applying loss to the whitened output. In this work, we introduce Spectral Transformation (ST), a framework to modulate the spectrum of embedding and to seek for functions beyond whitening that can avoid dimensional collapse. We show that whitening is a special instance of ST by definition, and our empirical investigations unveil other ST instances capable of preventing collapse. Additionally, we propose a novel ST instance named IterNorm with trace loss (INTL). Theoretical analysis confirms INTL's efficacy in preventing collapse and modulating the spectrum of embedding toward equal-eigenvalues during optimization. Our experiments on ImageNet classification and COCO object detection demonstrate INTL's potential in learning superior representations. The code is available at https://github.com/winci-ai/INTL.
△ Less
Submitted 21 January, 2024; v1 submitted 26 May, 2023;
originally announced May 2023.
-
PolarDB-IMCI: A Cloud-Native HTAP Database System at Alibaba
Authors:
Jianying Wang,
Tongliang Li,
Haoze Song,
Xinjun Yang,
Wenchao Zhou,
Feifei Li,
Baoyue Yan,
Qianqian Wu,
Yukun Liang,
Chengjun Ying,
Yujie Wang,
Baokai Chen,
Chang Cai,
Yubin Ruan,
Xiaoyi Weng,
Shibin Chen,
Liang Yin,
Chengzhong Yang,
Xin Cai,
Hongyan Xing,
Nanlong Yu,
Xiaofei Chen,
Dapeng Huang,
Jianling Sun
Abstract:
Cloud-native databases have become the de-facto choice for mission-critical applications on the cloud due to the need for high availability, resource elasticity, and cost efficiency. Meanwhile, driven by the increasing connectivity between data generation and analysis, users prefer a single database to efficiently process both OLTP and OLAP workloads, which enhances data freshness and reduces the…
▽ More
Cloud-native databases have become the de-facto choice for mission-critical applications on the cloud due to the need for high availability, resource elasticity, and cost efficiency. Meanwhile, driven by the increasing connectivity between data generation and analysis, users prefer a single database to efficiently process both OLTP and OLAP workloads, which enhances data freshness and reduces the complexity of data synchronization and the overall business cost.
In this paper, we summarize five crucial design goals for a cloud-native HTAP database based on our experience and customers' feedback, i.e., transparency, competitive OLAP performance, minimal perturbation on OLTP workloads, high data freshness, and excellent resource elasticity. As our solution to realize these goals, we present PolarDB-IMCI, a cloud-native HTAP database system designed and deployed at Alibaba Cloud. Our evaluation results show that PolarDB-IMCI is able to handle HTAP efficiently on both experimental and production workloads; notably, it speeds up analytical queries up to $\times149$ on TPC-H (100 $GB$). PolarDB-IMCI introduces low visibility delay and little performance perturbation on OLTP workloads (< 5%), and resource elasticity can be achieved by scaling out in tens of seconds.
△ Less
Submitted 15 May, 2023;
originally announced May 2023.
-
Task-Aware Risk Estimation of Perception Failures for Autonomous Vehicles
Authors:
Pasquale Antonante,
Sushant Veer,
Karen Leung,
Xinshuo Weng,
Luca Carlone,
Marco Pavone
Abstract:
Safety and performance are key enablers for autonomous driving: on the one hand we want our autonomous vehicles (AVs) to be safe, while at the same time their performance (e.g., comfort or progression) is key to adoption. To effectively walk the tight-rope between safety and performance, AVs need to be risk-averse, but not entirely risk-avoidant. To facilitate safe-yet-performant driving, in this…
▽ More
Safety and performance are key enablers for autonomous driving: on the one hand we want our autonomous vehicles (AVs) to be safe, while at the same time their performance (e.g., comfort or progression) is key to adoption. To effectively walk the tight-rope between safety and performance, AVs need to be risk-averse, but not entirely risk-avoidant. To facilitate safe-yet-performant driving, in this paper, we develop a task-aware risk estimator that assesses the risk a perception failure poses to the AV's motion plan. If the failure has no bearing on the safety of the AV's motion plan, then regardless of how egregious the perception failure is, our task-aware risk estimator considers the failure to have a low risk; on the other hand, if a seemingly benign perception failure severely impacts the motion plan, then our estimator considers it to have a high risk. In this paper, we propose a task-aware risk estimator to decide whether a safety maneuver needs to be triggered. To estimate the task-aware risk, first, we leverage the perception failure - detected by a perception monitor - to synthesize an alternative plausible model for the vehicle's surroundings. The risk due to the perception failure is then formalized as the "relative" risk to the AV's motion plan between the perceived and the alternative plausible scenario. We employ a statistical tool called copula, which models tail dependencies between distributions, to estimate this risk. The theoretical properties of the copula allow us to compute probably approximately correct (PAC) estimates of the risk. We evaluate our task-aware risk estimator using NuPlan and compare it with established baselines, showing that the proposed risk estimator achieves the best F1-score (doubling the score of the best baseline) and exhibits a good balance between recall and precision, i.e., a good balance of safety and performance.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Parameterized Learning and Distillation with Vortex-encoded Spectral Correlations
Authors:
Altai Perry,
Xiaojing Weng,
Erfan Nozari,
Luat Vuong
Abstract:
Spectral computational methods leverage modal or nonlocal representations of data, and a physically realized approach to spectral computation pertains to encoded diffraction. Encoded diffraction offers a hybrid approach that pairs analog wave propagation with digital back-end electronics, however the intermediate sensor patterns are correlations rather than linear signal weights, which limits the…
▽ More
Spectral computational methods leverage modal or nonlocal representations of data, and a physically realized approach to spectral computation pertains to encoded diffraction. Encoded diffraction offers a hybrid approach that pairs analog wave propagation with digital back-end electronics, however the intermediate sensor patterns are correlations rather than linear signal weights, which limits the development of robust and efficient downstream analyses. Here, with vortex encoders, we show that the solution for the signal field from sensor intensity adopts the form of polynomial regression, which is subsequently solved with a learned, linear transformation. This result establishes an analytic rationale for a spectral-methods paradigm in physically realized machine learning systems. To demonstrate this paradigm, we quantify the learning that is transferred with an image basis using speckle parameters, Singular-Value Decomposition Entropy ($H_{SVD}$) and Speckle-Analogue Density (SAD). We show that $H_{SVD}$, a proxy for image complexity, indicates the rate at which a model converges. Similarly, SAD, an averaged spatial frequency, marks a threshold for structurally similar reconstruction. With a vortex encoder, this approach with parameterized training may be extended to distill features. In fact, with images reconstructed with our models, we achieve classification accuracies that rival decade-old, state-of-the-art computer algorithms. This means that the process of learning compressed spectral correlations distills features to aid image classification, even when the goal images are feature-agnostic speckles. Our work highlights opportunities for analytic and axiom-driven machine-learning designs appropriate for real-time applications.
△ Less
Submitted 6 October, 2023; v1 submitted 24 April, 2023;
originally announced April 2023.
-
A High-Frequency Focused Network for Lightweight Single Image Super-Resolution
Authors:
Xiaotian Weng,
Yi Chen,
Zhichao Zheng,
Yanhui Gu,
Junsheng Zhou,
Yudong Zhang
Abstract:
Lightweight neural networks for single-image super-resolution (SISR) tasks have made substantial breakthroughs in recent years. Compared to low-frequency information, high-frequency detail is much more difficult to reconstruct. Most SISR models allocate equal computational resources for low-frequency and high-frequency information, which leads to redundant processing of simple low-frequency inform…
▽ More
Lightweight neural networks for single-image super-resolution (SISR) tasks have made substantial breakthroughs in recent years. Compared to low-frequency information, high-frequency detail is much more difficult to reconstruct. Most SISR models allocate equal computational resources for low-frequency and high-frequency information, which leads to redundant processing of simple low-frequency information and inadequate recovery of more challenging high-frequency information. We propose a novel High-Frequency Focused Network (HFFN) through High-Frequency Focused Blocks (HFFBs) that selectively enhance high-frequency information while minimizing redundant feature computation of low-frequency information. The HFFB effectively allocates more computational resources to the more challenging reconstruction of high-frequency information. Moreover, we propose a Local Feature Fusion Block (LFFB) effectively fuses features from multiple HFFBs in a local region, utilizing complementary information across layers to enhance feature representativeness and reduce artifacts in reconstructed images. We assess the efficacy of our proposed HFFN on five benchmark datasets and show that it significantly enhances the super-resolution performance of the network. Our experimental results demonstrate state-of-the-art performance in reconstructing high-frequency information while using a low number of parameters.
△ Less
Submitted 21 March, 2023;
originally announced March 2023.
-
Tree-structured Policy Planning with Learned Behavior Models
Authors:
Yuxiao Chen,
Peter Karkus,
Boris Ivanovic,
Xinshuo Weng,
Marco Pavone
Abstract:
Autonomous vehicles (AVs) need to reason about the multimodal behavior of neighboring agents while planning their own motion. Many existing trajectory planners seek a single trajectory that performs well under \emph{all} plausible futures simultaneously, ignoring bi-directional interactions and thus leading to overly conservative plans. Policy planning, whereby the ego agent plans a policy that re…
▽ More
Autonomous vehicles (AVs) need to reason about the multimodal behavior of neighboring agents while planning their own motion. Many existing trajectory planners seek a single trajectory that performs well under \emph{all} plausible futures simultaneously, ignoring bi-directional interactions and thus leading to overly conservative plans. Policy planning, whereby the ego agent plans a policy that reacts to the environment's multimodal behavior, is a promising direction as it can account for the action-reaction interactions between the AV and the environment. However, most existing policy planners do not scale to the complexity of real autonomous vehicle applications: they are either not compatible with modern deep learning prediction models, not interpretable, or not able to generate high quality trajectories. To fill this gap, we propose Tree Policy Planning (TPP), a policy planner that is compatible with state-of-the-art deep learning prediction models, generates multistage motion plans, and accounts for the influence of ego agent on the environment behavior. The key idea of TPP is to reduce the continuous optimization problem into a tractable discrete Markov Decision Process (MDP) through the construction of two tree structures: an ego trajectory tree for ego trajectory options, and a scenario tree for multi-modal ego-conditioned environment predictions. We demonstrate the efficacy of TPP in closed-loop simulations based on real-world nuScenes dataset and results show that TPP scales to realistic AV scenarios and significantly outperforms non-policy baselines.
△ Less
Submitted 26 February, 2023; v1 submitted 27 January, 2023;
originally announced January 2023.
-
Singular Value Decomposition and Entropy Dimension of Fractals
Authors:
Xiaojing Weng,
Altai Perry,
Michael Maroun,
Luat T. Vuong
Abstract:
We analyze the singular value decomposition (SVD) and SVD entropy of Cantor fractals produced by the Kronecker product. Our primary results show that SVD entropy is a measure of image ``complexity dimension" that is invariant under the number of Kronecker-product self-iterations (i.e., fractal order). SVD entropy is therefore similar to the fractal Hausdorff complexity dimension but suitable for c…
▽ More
We analyze the singular value decomposition (SVD) and SVD entropy of Cantor fractals produced by the Kronecker product. Our primary results show that SVD entropy is a measure of image ``complexity dimension" that is invariant under the number of Kronecker-product self-iterations (i.e., fractal order). SVD entropy is therefore similar to the fractal Hausdorff complexity dimension but suitable for characterizing fractal wave phenomena. Our field-based normalization (Renyi entropy index = 1) illustrates the uncommon step-shaped and cluster-patterned distributions of the fractal singular values and their SVD entropy. As a modal measure of complexity, SVD entropy has uses for a variety of wireless communication, free-space optical, and remote sensing applications.
△ Less
Submitted 15 November, 2022;
originally announced November 2022.
-
Depositing boron on Cu(111): Borophene or boride?
Authors:
Xiao-Ji Weng,
Jie Bai,
Jingyu Hou,
Yi Zhu,
Li Wang,
Penghui Li,
Anmin Nie,
Bo Xu,
Xiang-Feng Zhou,
Yongjun Tian
Abstract:
Large-area single-crystal surface structures were successfully prepared on Cu(111) substrate with boron deposition, which is critical for prospective applications. However, the proposed borophene structures do not match the scanning tunneling microscopy (STM) results very well, while the proposed copper boride is at odds with the traditional knowledge that ordered copper-rich borides normally do n…
▽ More
Large-area single-crystal surface structures were successfully prepared on Cu(111) substrate with boron deposition, which is critical for prospective applications. However, the proposed borophene structures do not match the scanning tunneling microscopy (STM) results very well, while the proposed copper boride is at odds with the traditional knowledge that ordered copper-rich borides normally do not exist due to small difference in electronegativity and large difference in atomic size. To clarify the controversy and elucidate the formation mechanism of the unexpected copper boride, we conducted systematic STM, X-ray photoelectron spectroscopy and angle-resolved photoemission spectroscopy investigations, confirming the synthesis of two-dimensional copper boride rather than borophene on Cu(111) after boron deposition under ultrahigh vacuum. First-principles calculations with defective surface models further indicate that boron atoms tend to react with Cu atoms near terrace edges or defects, which in turn shapes the intermediate structures of copper boride and leads to the formation of stable Cu-B monolayer via large-scale surface reconstruction eventually.
△ Less
Submitted 19 November, 2022;
originally announced November 2022.
-
Optimal Pricing Schemes in the Presence of Social Learning and Costly Reporting
Authors:
Kaiwei Zhang,
Xi Weng,
Xienan Cheng
Abstract:
A monopoly platform sells either a risky product (with unknown utility) or a safe product (with known utility) to agents who sequentially arrive and learn the utility of the risky product by the reporting of previous agents. It is costly for agents to report utility; hence the platform has to design both the prices and the reporting bonus to motivate the agents to explore and generate new informat…
▽ More
A monopoly platform sells either a risky product (with unknown utility) or a safe product (with known utility) to agents who sequentially arrive and learn the utility of the risky product by the reporting of previous agents. It is costly for agents to report utility; hence the platform has to design both the prices and the reporting bonus to motivate the agents to explore and generate new information. By allowing sellers to set bonuses, we are essentially enabling them to dynamically control the supply of learning signals without significantly affecting the demand for the product. We characterize the optimal bonus and pricing schemes offered by the profit-maximizing platform. It turns out that the optimal scheme falls into one of four types: Full Coverage, Partial Coverage, Immediate Revelation, and Non-Bonus. In a model of exponential bandit, we find that there is a dynamical switch of the types along the learning trajectory. Although learning stops efficiently, information is revealed too slowly compared with the planner's optimal solution.
△ Less
Submitted 9 December, 2023; v1 submitted 14 November, 2022;
originally announced November 2022.
-
An Investigation into Whitening Loss for Self-supervised Learning
Authors:
Xi Weng,
Lei Huang,
Lei Zhao,
Rao Muhammad Anwer,
Salman Khan,
Fahad Shahbaz Khan
Abstract:
A desirable objective in self-supervised learning (SSL) is to avoid feature collapse. Whitening loss guarantees collapse avoidance by minimizing the distance between embeddings of positive pairs under the conditioning that the embeddings from different views are whitened. In this paper, we propose a framework with an informative indicator to analyze whitening loss, which provides a clue to demysti…
▽ More
A desirable objective in self-supervised learning (SSL) is to avoid feature collapse. Whitening loss guarantees collapse avoidance by minimizing the distance between embeddings of positive pairs under the conditioning that the embeddings from different views are whitened. In this paper, we propose a framework with an informative indicator to analyze whitening loss, which provides a clue to demystify several interesting phenomena as well as a pivoting point connecting to other SSL methods. We reveal that batch whitening (BW) based methods do not impose whitening constraints on the embedding, but they only require the embedding to be full-rank. This full-rank constraint is also sufficient to avoid dimensional collapse. Based on our analysis, we propose channel whitening with random group partition (CW-RGP), which exploits the advantages of BW-based methods in preventing collapse and avoids their disadvantages requiring large batch size. Experimental results on ImageNet classification and COCO object detection reveal that the proposed CW-RGP possesses a promising potential for learning good representations. The code is available at https://github.com/winci-ai/CW-RGP.
△ Less
Submitted 7 October, 2022;
originally announced October 2022.
-
Helium-bearing superconductor at high pressure
Authors:
Jingyu Hou,
Xiao Dong,
Artem R. Oganov,
Xiao-Ji Weng,
Chun-Mei Hao,
Guochun Yang,
Hui-Tian Wang,
Xiang-Feng Zhou,
Yongjun Tian
Abstract:
Helium (He) is the most inert noble gas at ambient conditions. It adopts a hexagonal close packed structure (P63/mmc) and remains in the insulating phase up to 32 TPa. In contrast, lithium (Li) is one of the most reactive metals at zero pressure, while its cubic high-pressure phase (Fd-3m) is a weak metallic electride above 475 GPa. Strikingly, a stable compound of Li5He2 (R-3m) was formed by mixi…
▽ More
Helium (He) is the most inert noble gas at ambient conditions. It adopts a hexagonal close packed structure (P63/mmc) and remains in the insulating phase up to 32 TPa. In contrast, lithium (Li) is one of the most reactive metals at zero pressure, while its cubic high-pressure phase (Fd-3m) is a weak metallic electride above 475 GPa. Strikingly, a stable compound of Li5He2 (R-3m) was formed by mixing Fd-3m Li with P63/mmc He above 700 GPa. The presence of helium promotes the lattice transformation from Fd-3m Li to Pm-3m Li, and tuns the three-dimensional distributed interstitial electrons into the mixture of zero- and two-dimensional anionic electrons. This significantly increases the degree of metallization at the Fermi level, consequently, the coupling of conductive anionic electrons with the Li-dominated vibrations is the key factor to the formation of superconducting electride Li5He2 with a transition temperature up to 26 K, dynamically stable to pressures down to 210 GPa.
△ Less
Submitted 30 September, 2022;
originally announced September 2022.
-
Physical essence of propagable fractional-strength optical vortices in free space
Authors:
Xiaoyu Weng,
Yu Miao,
Yang Li,
Xiangmei Dong,
Xiumin Gao,
Songlin Zhuang
Abstract:
Fractional-order vector vortex beams are recently demonstrated to be new carriers of fractional-strength optical vortices. However, why can those new vortex beams formed by the combination of both unstable states propagate stably in free space? Here, we solve this scientific problem by revealing the physical essence of propagable fractional-strength optical vortices in free space.Three new underst…
▽ More
Fractional-order vector vortex beams are recently demonstrated to be new carriers of fractional-strength optical vortices. However, why can those new vortex beams formed by the combination of both unstable states propagate stably in free space? Here, we solve this scientific problem by revealing the physical essence of propagable fractional-strength optical vortices in free space.Three new understandings regarding those peculiar vortex beams are therefore proposed, namely Abbe diffraction limit, phase evolution of vortex beam, and phase binary time vector property.For the first one, owing to Abbe diffraction limit, the inherent polarization modes are intertwined together, thereby maintaining the entire peculiar vortex beams in free space. For the second one, we demonstrate the phase evolution of vortex beam, which is the physical reason of polarization rotation of fractional-order VVBs. For the third one, the phase is not merely a scalar attribute of light beam, but manifests a binary time vector property. This work provides entirely different physical viewpoints on the phase of vortex beam and Abbe diffraction limit, which may deepen our knowledge on the behavior of light beam in classical optics.
△ Less
Submitted 3 August, 2022;
originally announced August 2022.
-
Robust Trajectory Prediction against Adversarial Attacks
Authors:
Yulong Cao,
Danfei Xu,
Xinshuo Weng,
Zhuoqing Mao,
Anima Anandkumar,
Chaowei Xiao,
Marco Pavone
Abstract:
Trajectory prediction using deep neural networks (DNNs) is an essential component of autonomous driving (AD) systems. However, these methods are vulnerable to adversarial attacks, leading to serious consequences such as collisions. In this work, we identify two key ingredients to defend trajectory prediction models against adversarial attacks including (1) designing effective adversarial training…
▽ More
Trajectory prediction using deep neural networks (DNNs) is an essential component of autonomous driving (AD) systems. However, these methods are vulnerable to adversarial attacks, leading to serious consequences such as collisions. In this work, we identify two key ingredients to defend trajectory prediction models against adversarial attacks including (1) designing effective adversarial training methods and (2) adding domain-specific data augmentation to mitigate the performance degradation on clean data. We demonstrate that our method is able to improve the performance by 46% on adversarial data and at the cost of only 3% performance degradation on clean data, compared to the model trained with clean data. Additionally, compared to existing robust methods, our method can improve performance by 21% on adversarial examples and 9% on clean data. Our robust model is evaluated with a planner to study its downstream impacts. We demonstrate that our model can significantly reduce the severe accident rates (e.g., collisions and off-road driving).
△ Less
Submitted 29 July, 2022;
originally announced August 2022.
-
Multiface: A Dataset for Neural Face Rendering
Authors:
Cheng-hsin Wuu,
Ningyuan Zheng,
Scott Ardisson,
Rohan Bali,
Danielle Belko,
Eric Brockmeyer,
Lucas Evans,
Timothy Godisart,
Hyowon Ha,
Xuhua Huang,
Alexander Hypes,
Taylor Koska,
Steven Krenn,
Stephen Lombardi,
Xiaomin Luo,
Kevyn McPhail,
Laura Millerschoen,
Michal Perdoch,
Mark Pitts,
Alexander Richard,
Jason Saragih,
Junko Saragih,
Takaaki Shiratori,
Tomas Simon,
Matt Stewart
, et al. (6 additional authors not shown)
Abstract:
Photorealistic avatars of human faces have come a long way in recent years, yet research along this area is limited by a lack of publicly available, high-quality datasets covering both, dense multi-view camera captures, and rich facial expressions of the captured subjects. In this work, we present Multiface, a new multi-view, high-resolution human face dataset collected from 13 identities at Reali…
▽ More
Photorealistic avatars of human faces have come a long way in recent years, yet research along this area is limited by a lack of publicly available, high-quality datasets covering both, dense multi-view camera captures, and rich facial expressions of the captured subjects. In this work, we present Multiface, a new multi-view, high-resolution human face dataset collected from 13 identities at Reality Labs Research for neural face rendering. We introduce Mugsy, a large scale multi-camera apparatus to capture high-resolution synchronized videos of a facial performance. The goal of Multiface is to close the gap in accessibility to high quality data in the academic community and to enable research in VR telepresence. Along with the release of the dataset, we conduct ablation studies on the influence of different model architectures toward the model's interpolation capacity of novel viewpoint and expressions. With a conditional VAE model serving as our baseline, we found that adding spatial bias, texture warp field, and residual connections improves performance on novel view synthesis. Our code and data is available at: https://github.com/facebookresearch/multiface
△ Less
Submitted 26 June, 2023; v1 submitted 22 July, 2022;
originally announced July 2022.
-
Systematics of fully heavy dibaryons
Authors:
Xin-Zhen Weng,
Shi-Lin Zhu
Abstract:
We systematically study the mass spectra of the fully heavy dibaryons in an extended chromomagnetic model, which includes both the colorelectric and chromomagnetic interactions. We find no stable state below the corresponding baryon-baryon thresholds. Besides the masses, we also estimate the relative width ratios of the two-body decay channels. We hope our study will be of help for future experime…
▽ More
We systematically study the mass spectra of the fully heavy dibaryons in an extended chromomagnetic model, which includes both the colorelectric and chromomagnetic interactions. We find no stable state below the corresponding baryon-baryon thresholds. Besides the masses, we also estimate the relative width ratios of the two-body decay channels. We hope our study will be of help for future experiments.
△ Less
Submitted 6 February, 2024; v1 submitted 12 July, 2022;
originally announced July 2022.
-
Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking
Authors:
Jinkun Cao,
Jiangmiao Pang,
Xinshuo Weng,
Rawal Khirodkar,
Kris Kitani
Abstract:
Kalman filter (KF) based methods for multi-object tracking (MOT) make an assumption that objects move linearly. While this assumption is acceptable for very short periods of occlusion, linear estimates of motion for prolonged time can be highly inaccurate. Moreover, when there is no measurement available to update Kalman filter parameters, the standard convention is to trust the priori state estim…
▽ More
Kalman filter (KF) based methods for multi-object tracking (MOT) make an assumption that objects move linearly. While this assumption is acceptable for very short periods of occlusion, linear estimates of motion for prolonged time can be highly inaccurate. Moreover, when there is no measurement available to update Kalman filter parameters, the standard convention is to trust the priori state estimations for posteriori update. This leads to the accumulation of errors during a period of occlusion. The error causes significant motion direction variance in practice. In this work, we show that a basic Kalman filter can still obtain state-of-the-art tracking performance if proper care is taken to fix the noise accumulated during occlusion. Instead of relying only on the linear state estimate (i.e., estimation-centric approach), we use object observations (i.e., the measurements by object detector) to compute a virtual trajectory over the occlusion period to fix the error accumulation of filter parameters during the occlusion period. This allows more time steps to correct errors accumulated during occlusion. We name our method Observation-Centric SORT (OC-SORT). It remains Simple, Online, and Real-Time but improves robustness during occlusion and non-linear motion. Given off-the-shelf detections as input, OC-SORT runs at 700+ FPS on a single CPU. It achieves state-of-the-art on multiple datasets, including MOT17, MOT20, KITTI, head tracking, and especially DanceTrack where the object motion is highly non-linear. The code and models are available at \url{https://github.com/noahcao/OC_SORT}.
△ Less
Submitted 15 March, 2023; v1 submitted 27 March, 2022;
originally announced March 2022.
-
Deep Multi-Branch Aggregation Network for Real-Time Semantic Segmentation in Street Scenes
Authors:
Xi Weng,
Yan Yan,
Genshun Dong,
Chang Shu,
Biao Wang,
Hanzi Wang,
Ji Zhang
Abstract:
Real-time semantic segmentation, which aims to achieve high segmentation accuracy at real-time inference speed, has received substantial attention over the past few years. However, many state-of-the-art real-time semantic segmentation methods tend to sacrifice some spatial details or contextual information for fast inference, thus leading to degradation in segmentation quality. In this paper, we p…
▽ More
Real-time semantic segmentation, which aims to achieve high segmentation accuracy at real-time inference speed, has received substantial attention over the past few years. However, many state-of-the-art real-time semantic segmentation methods tend to sacrifice some spatial details or contextual information for fast inference, thus leading to degradation in segmentation quality. In this paper, we propose a novel Deep Multi-branch Aggregation Network (called DMA-Net) based on the encoder-decoder structure to perform real-time semantic segmentation in street scenes. Specifically, we first adopt ResNet-18 as the encoder to efficiently generate various levels of feature maps from different stages of convolutions. Then, we develop a Multi-branch Aggregation Network (MAN) as the decoder to effectively aggregate different levels of feature maps and capture the multi-scale information. In MAN, a lattice enhanced residual block is designed to enhance feature representations of the network by taking advantage of the lattice structure. Meanwhile, a feature transformation block is introduced to explicitly transform the feature map from the neighboring branch before feature aggregation. Moreover, a global context block is used to exploit the global contextual information. These key components are tightly combined and jointly optimized in a unified network. Extensive experimental results on the challenging Cityscapes and CamVid datasets demonstrate that our proposed DMA-Net respectively obtains 77.0% and 73.6% mean Intersection over Union (mIoU) at the inference speed of 46.7 FPS and 119.8 FPS by only using a single NVIDIA GTX 1080Ti GPU. This shows that DMA-Net provides a good tradeoff between segmentation quality and speed for semantic segmentation in street scenes.
△ Less
Submitted 8 March, 2022;
originally announced March 2022.
-
Stage-Aware Feature Alignment Network for Real-Time Semantic Segmentation of Street Scenes
Authors:
Xi Weng,
Yan Yan,
Si Chen,
Jing-Hao Xue,
Hanzi Wang
Abstract:
Over the past few years, deep convolutional neural network-based methods have made great progress in semantic segmentation of street scenes. Some recent methods align feature maps to alleviate the semantic gap between them and achieve high segmentation accuracy. However, they usually adopt the feature alignment modules with the same network configuration in the decoder and thus ignore the differen…
▽ More
Over the past few years, deep convolutional neural network-based methods have made great progress in semantic segmentation of street scenes. Some recent methods align feature maps to alleviate the semantic gap between them and achieve high segmentation accuracy. However, they usually adopt the feature alignment modules with the same network configuration in the decoder and thus ignore the different roles of stages of the decoder during feature aggregation, leading to a complex decoder structure. Such a manner greatly affects the inference speed. In this paper, we present a novel Stage-aware Feature Alignment Network (SFANet) based on the encoder-decoder structure for real-time semantic segmentation of street scenes. Specifically, a Stage-aware Feature Alignment module (SFA) is proposed to align and aggregate two adjacent levels of feature maps effectively. In the SFA, by taking into account the unique role of each stage in the decoder, a novel stage-aware Feature Enhancement Block (FEB) is designed to enhance spatial details and contextual information of feature maps from the encoder. In this way, we are able to address the misalignment problem with a very simple and efficient multi-branch decoder structure. Moreover, an auxiliary training strategy is developed to explicitly alleviate the multi-scale object problem without bringing additional computational costs during the inference phase. Experimental results show that the proposed SFANet exhibits a good balance between accuracy and speed for real-time semantic segmentation of street scenes. In particular, based on ResNet-18, SFANet respectively obtains 78.1% and 74.7% mean of class-wise Intersection-over-Union (mIoU) at inference speeds of 37 FPS and 96 FPS on the challenging Cityscapes and CamVid test datasets by using only a single GTX 1080Ti GPU.
△ Less
Submitted 8 March, 2022;
originally announced March 2022.
-
Property unification of inherent amplitude, phase and polarization within a light beam
Authors:
Xiaoyu Weng,
Yu Miao,
Guanxue Wang,
Yihui Wang,
Qiufang Zhan,
Xiangmei Dong,
Junle Qu,
Xiumin Gao,
Songlin Zhuang
Abstract:
Is it possible to modulate the inherent properties of a single light beam, namely amplitude, phase and polarization, simultaneously, by merely its phase? Here, we solve this scientific problem by unifying all these three properties of a single light beam using phase vectorization and phase version of Malus's law. Full-property spatial light modulator is therefore developed based on the unification…
▽ More
Is it possible to modulate the inherent properties of a single light beam, namely amplitude, phase and polarization, simultaneously, by merely its phase? Here, we solve this scientific problem by unifying all these three properties of a single light beam using phase vectorization and phase version of Malus's law. Full-property spatial light modulator is therefore developed based on the unification of these fundament links, which enables pixel-level polarization, amplitude and phase manipulation of light beams in a real-time dynamic way. This work not only implies that the amplitude, phase and polarization of a single light beam are interconnected, but also offers a solid answer on how to modulate these three natures of a single light beam simultaneously, which will deepen our understanding about the behavior of light beam, and facilitating extensive developments in optics and relate fields.
△ Less
Submitted 3 January, 2022;
originally announced January 2022.
-
Unusual phase transition of layer-stacked borophene under pressure
Authors:
Xiao-Ji Weng,
QuanSheng Wu,
Xi Shao,
Oleg V. Yazyev,
Xin-Ling He,
Xiao Dong,
Hui-Tian Wang,
Xiang-Feng Zhou,
Yongjun Tian
Abstract:
The 8-Pmmn borophene, a boron analogue of graphene, hosts tilted and anisotropic massless Dirac fermion quasiparticles owing to the presence of the distorted graphene-like sublattice. First-principles calculations show that the stacked 8-Pmmn borophene is transformed into the fused three-dimensional borophene under pressure, being accompanied by the partially bond-breaking and bond-reforming. Stri…
▽ More
The 8-Pmmn borophene, a boron analogue of graphene, hosts tilted and anisotropic massless Dirac fermion quasiparticles owing to the presence of the distorted graphene-like sublattice. First-principles calculations show that the stacked 8-Pmmn borophene is transformed into the fused three-dimensional borophene under pressure, being accompanied by the partially bond-breaking and bond-reforming. Strikingly, the fused 8-Pmmn borophene inherits the Dirac band dispersion resulting in an unusual semimetal-semimetal transition. A simple tight-binding model derived from graphene qualitatively reveals the underlying physics due to the maximum preservation of graphene-like substructure after the phase transition, which contrasts greatly to the transformation of graphite into diamond associated with the semimetal-insulator transition.
△ Less
Submitted 26 April, 2022; v1 submitted 30 November, 2021;
originally announced November 2021.
-
Optical demultiplexing of fractal-structured beams in turbulent atmospheric environments
Authors:
Xiaojing Weng,
Luat T. Vuong
Abstract:
When information is spatially repeated in self-similar fractal beam patterns, only a portion of the diffracted beam is needed to reconstruct the kernel data. What is unique to a fractal-encoding scheme is that the image demultiplexing process can be, to a first approximation, easily performed optically. In prior work, we experimentally and numerically study fractal-encoded optical beams and their…
▽ More
When information is spatially repeated in self-similar fractal beam patterns, only a portion of the diffracted beam is needed to reconstruct the kernel data. What is unique to a fractal-encoding scheme is that the image demultiplexing process can be, to a first approximation, easily performed optically. In prior work, we experimentally and numerically study fractal-encoded optical beams and their mid- and far-field propagation without added turbulence. Here, we present preliminary simulations of fractal-encoded beams with high turbulence ($C_n^2 \geq 10^{-14}$ m$^{-2/3}$) where we achieve respectable bit error rates of $10^{-3}$. These results are impressive given that: data with low fractal orders is shown, simple threshold-algorithms are used (i.e., no machine learning), and only a third of the beam, off-axis, is needed. More robust channel encoding is associated with increased fractal orders, larger collection areas, and higher kernel singular value decomposition entropy.
△ Less
Submitted 22 January, 2024; v1 submitted 3 November, 2021;
originally announced November 2021.
-
MTP: Multi-Hypothesis Tracking and Prediction for Reduced Error Propagation
Authors:
Xinshuo Weng,
Boris Ivanovic,
Marco Pavone
Abstract:
Recently, there has been tremendous progress in developing each individual module of the standard perception-planning robot autonomy pipeline, including detection, tracking, prediction of other agents' trajectories, and ego-agent trajectory planning. Nevertheless, there has been less attention given to the principled integration of these components, particularly in terms of the characterization an…
▽ More
Recently, there has been tremendous progress in developing each individual module of the standard perception-planning robot autonomy pipeline, including detection, tracking, prediction of other agents' trajectories, and ego-agent trajectory planning. Nevertheless, there has been less attention given to the principled integration of these components, particularly in terms of the characterization and mitigation of cascading errors. This paper addresses the problem of cascading errors by focusing on the coupling between the tracking and prediction modules. First, by using state-of-the-art tracking and prediction tools, we conduct a comprehensive experimental evaluation of how severely errors stemming from tracking can impact prediction performance. On the KITTI and nuScenes datasets, we find that predictions consuming tracked trajectories as inputs (the typical case in practice) can experience a significant (even order of magnitude) drop in performance in comparison to the idealized setting where ground truth past trajectories are used as inputs. To address this issue, we propose a multi-hypothesis tracking and prediction framework. Rather than relying on a single set of tracking results for prediction, our framework simultaneously reasons about multiple sets of tracking results, thereby increasing the likelihood of including accurate tracking results as inputs to prediction. We show that this framework improves overall prediction performance over the standard single-hypothesis tracking-prediction pipeline by up to 34.2% on the nuScenes dataset, with even more significant improvements (up to ~70%) when restricting the evaluation to challenging scenarios involving identity switches and fragments -- all with an acceptable computation overhead.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Triply heavy tetraquark states
Authors:
Xin-Zhen Weng,
Wei-Zhen Deng,
Shi-Lin Zhu
Abstract:
In the framework of an extended chromomagnetic model, we systematically study the mass spectrum of the $S$-wave $qQ\bar{Q}\bar{Q}$ tetraquarks. Their mass spectra are mainly determined by the color interaction. For the $qc\bar{c}\bar{c}$, $qb\bar{c}\bar{c}$ and $qb\bar{b}\bar{b}$ tetraquarks, the color interaction favors the color-sextet $\ket{(qQ)^{6_{c}}(\bar{Q}\bar{Q})^{\bar{6}_{c}}}$ configura…
▽ More
In the framework of an extended chromomagnetic model, we systematically study the mass spectrum of the $S$-wave $qQ\bar{Q}\bar{Q}$ tetraquarks. Their mass spectra are mainly determined by the color interaction. For the $qc\bar{c}\bar{c}$, $qb\bar{c}\bar{c}$ and $qb\bar{b}\bar{b}$ tetraquarks, the color interaction favors the color-sextet $\ket{(qQ)^{6_{c}}(\bar{Q}\bar{Q})^{\bar{6}_{c}}}$ configuration over the color-triplet $\ket{(qQ)^{\bar{3}_{c}}(\bar{Q}\bar{Q})^{3_{c}}}$ one. But for the $qc\bar{b}\bar{b}$ tetraquarks, the color-triplet configuration is favored. We find no stable states which lie below the thresholds of two pseudoscalar mesons. The lowest axial-vector states with the $qQ\bar{b}\bar{b}$ flavor configuration may be narrow. They lie just above the thresholds of two pseudoscalar mesons, but cannot decay into these channels because of the conservation of the angular momentum and parity.
△ Less
Submitted 24 February, 2022; v1 submitted 11 September, 2021;
originally announced September 2021.
-
Doubly heavy tetraquarks in an extended chromomagnetic model
Authors:
Xin-Zhen Weng,
Wei-Zhen Deng,
Shi-Lin Zhu
Abstract:
Using an extended chromomagnetic model, we perform a systematic study of the masses of the doubly heavy tetraquarks. We find that the ground states of the doubly heavy tetraquarks are dominated by color-triplet $\ket{(qq)^{\bar{3}_{c}}(\bar{Q}\bar{Q})^{3_{c}}}$ configuration, which is opposite to that of the fully heavy tetraquarks. The combined results suggest that the color-triplet configuration…
▽ More
Using an extended chromomagnetic model, we perform a systematic study of the masses of the doubly heavy tetraquarks. We find that the ground states of the doubly heavy tetraquarks are dominated by color-triplet $\ket{(qq)^{\bar{3}_{c}}(\bar{Q}\bar{Q})^{3_{c}}}$ configuration, which is opposite to that of the fully heavy tetraquarks. The combined results suggest that the color-triplet configuration becomes more important when the mass difference between the quarks and antiquarks increases. We find three stable states which lie below the thresholds of two pseudoscalar mesons. They are the $IJ^{P}=01^{+}$ $nn\bar{b}\bar{b}$ tetraquark, the $IJ^{P}=00^{+}$ $nn\bar{c}\bar{b}$ tetraquark and the $J^{P}=1^{+}$ $ns\bar{b}\bar{b}$ tetraquark.
△ Less
Submitted 5 October, 2021; v1 submitted 16 August, 2021;
originally announced August 2021.
-
Multi-Echo LiDAR for 3D Object Detection
Authors:
Yunze Man,
Xinshuo Weng,
Prasanna Kumar Sivakuma,
Matthew O'Toole,
Kris Kitani
Abstract:
LiDAR sensors can be used to obtain a wide range of measurement signals other than a simple 3D point cloud, and those signals can be leveraged to improve perception tasks like 3D object detection. A single laser pulse can be partially reflected by multiple objects along its path, resulting in multiple measurements called echoes. Multi-echo measurement can provide information about object contours…
▽ More
LiDAR sensors can be used to obtain a wide range of measurement signals other than a simple 3D point cloud, and those signals can be leveraged to improve perception tasks like 3D object detection. A single laser pulse can be partially reflected by multiple objects along its path, resulting in multiple measurements called echoes. Multi-echo measurement can provide information about object contours and semi-transparent surfaces which can be used to better identify and locate objects. LiDAR can also measure surface reflectance (intensity of laser pulse return), as well as ambient light of the scene (sunlight reflected by objects). These signals are already available in commercial LiDAR devices but have not been used in most LiDAR-based detection models. We present a 3D object detection model which leverages the full spectrum of measurement signals provided by LiDAR. First, we propose a multi-signal fusion (MSF) module to combine (1) the reflectance and ambient features extracted with a 2D CNN, and (2) point cloud features extracted using a 3D graph neural network (GNN). Second, we propose a multi-echo aggregation (MEA) module to combine the information encoded in different set of echo points. Compared with traditional single echo point cloud methods, our proposed Multi-Signal LiDAR Detector (MSLiD) extracts richer context information from a wider range of sensing measurements and achieves more accurate 3D object detection. Experiments show that by incorporating the multi-modality of LiDAR, our method outperforms the state-of-the-art by up to 9.1%.
△ Less
Submitted 23 July, 2021;
originally announced July 2021.
-
Multi-Modality Task Cascade for 3D Object Detection
Authors:
Jinhyung Park,
Xinshuo Weng,
Yunze Man,
Kris Kitani
Abstract:
Point clouds and RGB images are naturally complementary modalities for 3D visual understanding - the former provides sparse but accurate locations of points on objects, while the latter contains dense color and texture information. Despite this potential for close sensor fusion, many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data. This separa…
▽ More
Point clouds and RGB images are naturally complementary modalities for 3D visual understanding - the former provides sparse but accurate locations of points on objects, while the latter contains dense color and texture information. Despite this potential for close sensor fusion, many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data. This separated training scheme results in potentially sub-optimal performance and prevents 3D tasks from being used to benefit 2D tasks that are often useful on their own. To provide a more integrated approach, we propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions, which are then used to further refine the 3D boxes. We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance. Moreover, to prevent the 3D module from over-relying on the overfitted 2D predictions, we propose a dual-head 2D segmentation training and inference scheme, allowing the 2nd 3D module to learn to interpret imperfect 2D segmentation predictions. Evaluating our model on the challenging SUN RGB-D dataset, we improve upon state-of-the-art results of both single modality and fusion networks by a large margin ($\textbf{+3.8}$ mAP@0.5). Code will be released $\href{https://github.com/Divadi/MTC_RCNN}{\text{here.}}$
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
Light beam carrying natural non-integer orbital angular momentum in free space
Authors:
Xiaoyu Weng,
Yu Miao,
Guanxue Wang,
Qiufang Zhan,
Xiangmei Dong,
Junle Qu,
Xiumin Gao,
Songlin Zhuang
Abstract:
Light beam with optical vortices can propagate in free space only with integer orbital angular momentum. Here, we invert this scientific consensus theoretically and experimentally by proposing light beams carrying natural non-integer orbital angular momentum. These peculiar light beams are actually special solutions of wave function, which possess optical vortices with the topological charge l+0.5…
▽ More
Light beam with optical vortices can propagate in free space only with integer orbital angular momentum. Here, we invert this scientific consensus theoretically and experimentally by proposing light beams carrying natural non-integer orbital angular momentum. These peculiar light beams are actually special solutions of wave function, which possess optical vortices with the topological charge l+0.5, where l is an integer. Owing to the interaction of phase and polarization singularity, these vortex beams with fractional topological charge can maintain their amplitude and vortex phase even when they propagate to an infinite distance. This work demonstrates another state of optical vortices in free space, which will fundamentally inject new vigor into optics, and other relate scientific fields.
△ Less
Submitted 24 May, 2021;
originally announced May 2021.
-
Wide-Baseline Multi-Camera Calibration using Person Re-Identification
Authors:
Yan Xu,
Yu-Jhe Li,
Xinshuo Weng,
Kris Kitani
Abstract:
We address the problem of estimating the 3D pose of a network of cameras for large-environment wide-baseline scenarios, e.g., cameras for construction sites, sports stadiums, and public spaces. This task is challenging since detecting and matching the same 3D keypoint observed from two very different camera views is difficult, making standard structure-from-motion (SfM) pipelines inapplicable. In…
▽ More
We address the problem of estimating the 3D pose of a network of cameras for large-environment wide-baseline scenarios, e.g., cameras for construction sites, sports stadiums, and public spaces. This task is challenging since detecting and matching the same 3D keypoint observed from two very different camera views is difficult, making standard structure-from-motion (SfM) pipelines inapplicable. In such circumstances, treating people in the scene as "keypoints" and associating them across different camera views can be an alternative method for obtaining correspondences. Based on this intuition, we propose a method that uses ideas from person re-identification (re-ID) for wide-baseline camera calibration. Our method first employs a re-ID method to associate human bounding boxes across cameras, then converts bounding box correspondences to point correspondences, and finally solves for camera pose using multi-view geometry and bundle adjustment. Since our method does not require specialized calibration targets except for visible people, it applies to situations where frequent calibration updates are required. We perform extensive experiments on datasets captured from scenes of different sizes, camera settings (indoor and outdoor), and human activities (walking, playing basketball, construction). Experiment results show that our method achieves similar performance to standard SfM methods relying on manually labeled point correspondences.
△ Less
Submitted 17 April, 2021;
originally announced April 2021.
-
AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting
Authors:
Ye Yuan,
Xinshuo Weng,
Yanglan Ou,
Kris Kitani
Abstract:
Predicting accurate future trajectories of multiple agents is essential for autonomous systems, but is challenging due to the complex agent interaction and the uncertainty in each agent's future behavior. Forecasting multi-agent trajectories requires modeling two key dimensions: (1) time dimension, where we model the influence of past agent states over future states; (2) social dimension, where we…
▽ More
Predicting accurate future trajectories of multiple agents is essential for autonomous systems, but is challenging due to the complex agent interaction and the uncertainty in each agent's future behavior. Forecasting multi-agent trajectories requires modeling two key dimensions: (1) time dimension, where we model the influence of past agent states over future states; (2) social dimension, where we model how the state of each agent affects others. Most prior methods model these two dimensions separately, e.g., first using a temporal model to summarize features over time for each agent independently and then modeling the interaction of the summarized features with a social model. This approach is suboptimal since independent feature encoding over either the time or social dimension can result in a loss of information. Instead, we would prefer a method that allows an agent's state at one time to directly affect another agent's state at a future time. To this end, we propose a new Transformer, AgentFormer, that jointly models the time and social dimensions. The model leverages a sequence representation of multi-agent trajectories by flattening trajectory features across time and agents. Since standard attention operations disregard the agent identity of each element in the sequence, AgentFormer uses a novel agent-aware attention mechanism that preserves agent identities by attending to elements of the same agent differently than elements of other agents. Based on AgentFormer, we propose a stochastic multi-agent trajectory prediction model that can attend to features of any agent at any previous timestep when inferring an agent's future position. The latent intent of all agents is also jointly modeled, allowing the stochasticity in one agent's behavior to affect other agents. Our method substantially improves the state of the art on well-established pedestrian and autonomous driving datasets.
△ Less
Submitted 7 October, 2021; v1 submitted 25 March, 2021;
originally announced March 2021.