subscribe to arXiv mailings

arXiv:2406.12095 [pdf, other]

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

Authors: Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic, Steven L. Waslander, Yue Wang, Sanja Fidler, Marco Pavone, Peter Karkus

Abstract: We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in autonomous driving. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs, and is trained self-supervised with differentiable rendering to reconstruct RGB,… ▽ More We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in autonomous driving. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets for training, thereby helping our model to learn 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes dataset demonstrate that DistillNeRF significantly outperforms existing comparable self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2403.11492 [pdf, other]

SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction

Authors: Yang Zhou, Hao Shao, Letian Wang, Steven L. Waslander, Hongsheng Li, Yu Liu

Abstract: Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. Context information, such as road maps and surrounding agents' states, provides crucial geometric and semantic information for motion behavior prediction. To this end, recent works explore two-stage prediction frameworks where coarse trajectori… ▽ More Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. Context information, such as road maps and surrounding agents' states, provides crucial geometric and semantic information for motion behavior prediction. To this end, recent works explore two-stage prediction frameworks where coarse trajectories are first proposed, and then used to select critical context information for trajectory refinement. However, they either incur a large amount of computation or bring limited improvement, if not both. In this paper, we introduce a novel scenario-adaptive refinement strategy, named SmartRefine, to refine prediction with minimal additional computation. Specifically, SmartRefine can comprehensively adapt refinement configurations based on each scenario's properties, and smartly chooses the number of refinement iterations by introducing a quality score to measure the prediction quality and remaining refinement potential of each scenario. SmartRefine is designed as a generic and flexible approach that can be seamlessly integrated into most state-of-the-art motion prediction models. Experiments on Argoverse (1 & 2) show that our method consistently improves the prediction accuracy of multiple state-of-the-art prediction models. Specifically, by adding SmartRefine to QCNet, we outperform all published ensemble-free works on the Argoverse 2 leaderboard (single agent track) at submission. Comprehensive studies are also conducted to ablate design choices and explore the mechanism behind multi-iteration refinement. Codes are available at https://github.com/opendilab/SmartRefine/ △ Less

Submitted 19 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: Camera-ready version for CVPR 2024

arXiv:2402.12303 [pdf, other]

UncertaintyTrack: Exploiting Detection and Localization Uncertainty in Multi-Object Tracking

Authors: Chang Won Lee, Steven L. Waslander

Abstract: Multi-object tracking (MOT) methods have seen a significant boost in performance recently, due to strong interest from the research community and steadily improving object detection methods. The majority of tracking methods follow the tracking-by-detection (TBD) paradigm, blindly trust the incoming detections with no sense of their associated localization uncertainty. This lack of uncertainty awar… ▽ More Multi-object tracking (MOT) methods have seen a significant boost in performance recently, due to strong interest from the research community and steadily improving object detection methods. The majority of tracking methods follow the tracking-by-detection (TBD) paradigm, blindly trust the incoming detections with no sense of their associated localization uncertainty. This lack of uncertainty awareness poses a problem in safety-critical tasks such as autonomous driving where passengers could be put at risk due to erroneous detections that have propagated to downstream tasks, including MOT. While there are existing works in probabilistic object detection that predict the localization uncertainty around the boxes, no work in 2D MOT for autonomous driving has studied whether these estimates are meaningful enough to be leveraged effectively in object tracking. We introduce UncertaintyTrack, a collection of extensions that can be applied to multiple TBD trackers to account for localization uncertainty estimates from probabilistic object detectors. Experiments on the Berkeley Deep Drive MOT dataset show that the combination of our method and informative uncertainty estimates reduces the number of ID switches by around 19\% and improves mMOTA by 2-3%. The source code is available at https://github.com/TRAILab/UncertaintyTrack △ Less

Submitted 29 April, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

Comments: Accepted to ICRA 2024

arXiv:2402.06537 [pdf, other]

Feature Density Estimation for Out-of-Distribution Detection via Normalizing Flows

Authors: Evan D. Cook, Marc-Antoine Lavoie, Steven L. Waslander

Abstract: Out-of-distribution (OOD) detection is a critical task for safe deployment of learning systems in the open world setting. In this work, we investigate the use of feature density estimation via normalizing flows for OOD detection and present a fully unsupervised approach which requires no exposure to OOD data, avoiding researcher bias in OOD sample selection. This is a post-hoc method which can be… ▽ More Out-of-distribution (OOD) detection is a critical task for safe deployment of learning systems in the open world setting. In this work, we investigate the use of feature density estimation via normalizing flows for OOD detection and present a fully unsupervised approach which requires no exposure to OOD data, avoiding researcher bias in OOD sample selection. This is a post-hoc method which can be applied to any pretrained model, and involves training a lightweight auxiliary normalizing flow model to perform the out-of-distribution detection via density thresholding. Experiments on OOD detection in image classification show strong results for far-OOD data detection with only a single epoch of flow training, including 98.2% AUROC for ImageNet-1k vs. Textures, which exceeds the state of the art by 7.8%. We additionally explore the connection between the feature space distribution of the pretrained model and the performance of our method. Finally, we provide insights into training pitfalls that have plagued normalizing flows for use in OOD detection. △ Less

Submitted 29 April, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

Comments: Accepted to CRV 2024

arXiv:2312.07488 [pdf, other]

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

Authors: Hao Shao, Yuxuan Hu, Letian Wang, Steven L. Waslander, Yu Liu, Hongsheng Li

Abstract: Despite significant recent progress in the field of autonomous driving, modern methods still struggle and can incur serious accidents when encountering long-tail unforeseen events and challenging urban scenarios. On the one hand, large language models (LLM) have shown impressive reasoning capabilities that approach "Artificial General Intelligence". On the other hand, previous autonomous driving m… ▽ More Despite significant recent progress in the field of autonomous driving, modern methods still struggle and can incur serious accidents when encountering long-tail unforeseen events and challenging urban scenarios. On the one hand, large language models (LLM) have shown impressive reasoning capabilities that approach "Artificial General Intelligence". On the other hand, previous autonomous driving methods tend to rely on limited-format inputs (e.g. sensor data and navigation waypoints), restricting the vehicle's ability to understand language information and interact with humans. To this end, this paper introduces LMDrive, a novel language-guided, end-to-end, closed-loop autonomous driving framework. LMDrive uniquely processes and integrates multi-modal sensor data with natural language instructions, enabling interaction with humans and navigation software in realistic instructional settings. To facilitate further research in language-based closed-loop autonomous driving, we also publicly release the corresponding dataset which includes approximately 64K instruction-following data clips, and the LangAuto benchmark that tests the system's ability to handle complex instructions and challenging driving scenarios. Extensive closed-loop experiments are conducted to demonstrate LMDrive's effectiveness. To the best of our knowledge, we're the very first work to leverage LLMs for closed-loop end-to-end autonomous driving. Codes, models, and datasets can be found at https://github.com/opendilab/LMDrive △ Less

Submitted 21 December, 2023; v1 submitted 12 December, 2023; originally announced December 2023.

Comments: project page: https://hao-shao.com/projects/lmdrive.html

arXiv:2311.10983 [pdf, other]

Multiple View Geometry Transformers for 3D Human Pose Estimation

Authors: Ziwei Liao, Jialiang Zhu, Chunyu Wang, Han Hu, Steven L. Waslander

Abstract: In this work, we aim to improve the 3D reasoning ability of Transformers in multi-view 3D human pose estimation. Recent works have focused on end-to-end learning-based transformer designs, which struggle to resolve geometric information accurately, particularly during occlusion. Instead, we propose a novel hybrid model, MVGFormer, which has a series of geometric and appearance modules organized in… ▽ More In this work, we aim to improve the 3D reasoning ability of Transformers in multi-view 3D human pose estimation. Recent works have focused on end-to-end learning-based transformer designs, which struggle to resolve geometric information accurately, particularly during occlusion. Instead, we propose a novel hybrid model, MVGFormer, which has a series of geometric and appearance modules organized in an iterative manner. The geometry modules are learning-free and handle all viewpoint-dependent 3D tasks geometrically which notably improves the model's generalization ability. The appearance modules are learnable and are dedicated to estimating 2D poses from image signals end-to-end which enables them to achieve accurate estimates even when occlusion occurs, leading to a model that is both accurate and generalizable to new cameras and geometries. We evaluate our approach for both in-domain and out-of-domain settings, where our model consistently outperforms state-of-the-art methods, and especially does so by a significant margin in the out-of-domain setting. We will release the code and models: https://github.com/XunshanMan/MVGFormer. △ Less

Submitted 18 November, 2023; originally announced November 2023.

Comments: 14 pages, 8 figures

arXiv:2309.09118 [pdf, other]

Uncertainty-aware 3D Object-Level Mapping with Deep Shape Priors

Authors: Ziwei Liao, Jun Yang, Jingxing Qian, Angela P. Schoellig, Steven L. Waslander

Abstract: 3D object-level mapping is a fundamental problem in robotics, which is especially challenging when object CAD models are unavailable during inference. In this work, we propose a framework that can reconstruct high-quality object-level maps for unknown objects. Our approach takes multiple RGB-D images as input and outputs dense 3D shapes and 9-DoF poses (including 3 scale parameters) for detected o… ▽ More 3D object-level mapping is a fundamental problem in robotics, which is especially challenging when object CAD models are unavailable during inference. In this work, we propose a framework that can reconstruct high-quality object-level maps for unknown objects. Our approach takes multiple RGB-D images as input and outputs dense 3D shapes and 9-DoF poses (including 3 scale parameters) for detected objects. The core idea of our approach is to leverage a learnt generative model for shape categories as a prior and to formulate a probabilistic, uncertainty-aware optimization framework for 3D reconstruction. We derive a probabilistic formulation that propagates shape and pose uncertainty through two novel loss functions. Unlike current state-of-the-art approaches, we explicitly model the uncertainty of the object shapes and poses during our optimization, resulting in a high-quality object-level mapping system. Moreover, the resulting shape and pose uncertainties, which we demonstrate can accurately reflect the true errors of our object maps, can also be useful for downstream robotics tasks such as active vision. We perform extensive evaluations on indoor and outdoor real-world datasets, achieving achieves substantial improvements over state-of-the-art methods. Our code will be available at https://github.com/TRAILab/UncertainShapePose. △ Less

Submitted 16 September, 2023; originally announced September 2023.

Comments: Manuscript submitted to ICRA 2024

arXiv:2308.14665 [pdf, other]

Active Pose Refinement for Textureless Shiny Objects using the Structured Light Camera

Authors: Jun Yang, Jian Yao, Steven L. Waslander

Abstract: 6D pose estimation of textureless shiny objects has become an essential problem in many robotic applications. Many pose estimators require high-quality depth data, often measured by structured light cameras. However, when objects have shiny surfaces (e.g., metal parts), these cameras fail to sense complete depths from a single viewpoint due to the specular reflection, resulting in a significant dr… ▽ More 6D pose estimation of textureless shiny objects has become an essential problem in many robotic applications. Many pose estimators require high-quality depth data, often measured by structured light cameras. However, when objects have shiny surfaces (e.g., metal parts), these cameras fail to sense complete depths from a single viewpoint due to the specular reflection, resulting in a significant drop in the final pose accuracy. To mitigate this issue, we present a complete active vision framework for 6D object pose refinement and next-best-view prediction. Specifically, we first develop an optimization-based pose refinement module for the structured light camera. Our system then selects the next best camera viewpoint to collect depth measurements by minimizing the predicted uncertainty of the object pose. Compared to previous approaches, we additionally predict measurement uncertainties of future viewpoints by online rendering, which significantly improves the next-best-view prediction performance. We test our approach on the challenging real-world ROBI dataset. The results demonstrate that our pose refinement method outperforms the traditional ICP-based approach when given the same input depth data, and our next-best-view strategy can achieve high object pose accuracy with significantly fewer viewpoints than the heuristic-based policies. △ Less

Submitted 28 August, 2023; originally announced August 2023.

arXiv:2307.00488 [pdf, other]

POV-SLAM: Probabilistic Object-Aware Variational SLAM in Semi-Static Environments

Authors: Jingxing Qian, Veronica Chatrath, James Servos, Aaron Mavrinac, Wolfram Burgard, Steven L. Waslander, Angela P. Schoellig

Abstract: Simultaneous localization and mapping (SLAM) in slowly varying scenes is important for long-term robot task completion. Failing to detect scene changes may lead to inaccurate maps and, ultimately, lost robots. Classical SLAM algorithms assume static scenes, and recent works take dynamics into account, but require scene changes to be observed in consecutive frames. Semi-static scenes, wherein objec… ▽ More Simultaneous localization and mapping (SLAM) in slowly varying scenes is important for long-term robot task completion. Failing to detect scene changes may lead to inaccurate maps and, ultimately, lost robots. Classical SLAM algorithms assume static scenes, and recent works take dynamics into account, but require scene changes to be observed in consecutive frames. Semi-static scenes, wherein objects appear, disappear, or move slowly over time, are often overlooked, yet are critical for long-term operation. We propose an object-aware, factor-graph SLAM framework that tracks and reconstructs semi-static object-level changes. Our novel variational expectation-maximization strategy is used to optimize factor graphs involving a Gaussian-Uniform bimodal measurement likelihood for potentially-changing objects. We evaluate our approach alongside the state-of-the-art SLAM solutions in simulation and on our novel real-world SLAM dataset captured in a warehouse over four months. Our method improves the robustness of localization in the presence of semi-static changes, providing object-level reasoning about the scene. △ Less

Submitted 2 July, 2023; originally announced July 2023.

Comments: Published in Robotics: Science and Systems (RSS) 2023

arXiv:2306.11739 [pdf, other]

Multi-view 3D Object Reconstruction and Uncertainty Modelling with Neural Shape Prior

Authors: Ziwei Liao, Steven L. Waslander

Abstract: 3D object reconstruction is important for semantic scene understanding. It is challenging to reconstruct detailed 3D shapes from monocular images directly due to a lack of depth information, occlusion and noise. Most current methods generate deterministic object models without any awareness of the uncertainty of the reconstruction. We tackle this problem by leveraging a neural object representatio… ▽ More 3D object reconstruction is important for semantic scene understanding. It is challenging to reconstruct detailed 3D shapes from monocular images directly due to a lack of depth information, occlusion and noise. Most current methods generate deterministic object models without any awareness of the uncertainty of the reconstruction. We tackle this problem by leveraging a neural object representation which learns an object shape distribution from large dataset of 3d object models and maps it into a latent space. We propose a method to model uncertainty as part of the representation and define an uncertainty-aware encoder which generates latent codes with uncertainty directly from individual input images. Further, we propose a method to propagate the uncertainty in the latent code to SDF values and generate a 3d object mesh with local uncertainty for each mesh component. Finally, we propose an incremental fusion method under a Bayesian framework to fuse the latent codes from multi-view observations. We evaluate the system in both synthetic and real datasets to demonstrate the effectiveness of uncertainty-based fusion to improve 3D object reconstruction accuracy. △ Less

Submitted 6 November, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

Comments: Manuscript accepted by WACV 2024

arXiv:2305.10507 [pdf, other]

ReasonNet: End-to-End Driving with Temporal and Global Reasoning

Authors: Hao Shao, Letian Wang, Ruobing Chen, Steven L. Waslander, Hongsheng Li, Yu Liu

Abstract: The large-scale deployment of autonomous vehicles is yet to come, and one of the major remaining challenges lies in urban dense traffic scenarios. In such cases, it remains challenging to predict the future evolution of the scene and future behaviors of objects, and to deal with rare adverse events such as the sudden appearance of occluded objects. In this paper, we present ReasonNet, a novel end-… ▽ More The large-scale deployment of autonomous vehicles is yet to come, and one of the major remaining challenges lies in urban dense traffic scenarios. In such cases, it remains challenging to predict the future evolution of the scene and future behaviors of objects, and to deal with rare adverse events such as the sudden appearance of occluded objects. In this paper, we present ReasonNet, a novel end-to-end driving framework that extensively exploits both temporal and global information of the driving scene. By reasoning on the temporal behavior of objects, our method can effectively process the interactions and relationships among features in different frames. Reasoning about the global information of the scene can also improve overall perception performance and benefit the detection of adverse events, especially the anticipation of potential danger from occluded objects. For comprehensive evaluation on occlusion events, we also release publicly a driving simulation benchmark DriveOcclusionSim consisting of diverse occlusion events. We conduct extensive experiments on multiple CARLA benchmarks, where our model outperforms all prior methods, ranking first on the sensor track of the public CARLA Leaderboard. △ Less

Submitted 17 May, 2023; originally announced May 2023.

Comments: CVPR 2023

arXiv:2305.04412 [pdf, other]

Efficient Reinforcement Learning for Autonomous Driving with Parameterized Skills and Priors

Authors: Letian Wang, Jie Liu, Hao Shao, Wenshuo Wang, Ruobing Chen, Yu Liu, Steven L. Waslander

Abstract: When autonomous vehicles are deployed on public roads, they will encounter countless and diverse driving situations. Many manually designed driving policies are difficult to scale to the real world. Fortunately, reinforcement learning has shown great success in many tasks by automatic trial and error. However, when it comes to autonomous driving in interactive dense traffic, RL agents either fail… ▽ More When autonomous vehicles are deployed on public roads, they will encounter countless and diverse driving situations. Many manually designed driving policies are difficult to scale to the real world. Fortunately, reinforcement learning has shown great success in many tasks by automatic trial and error. However, when it comes to autonomous driving in interactive dense traffic, RL agents either fail to learn reasonable performance or necessitate a large amount of data. Our insight is that when humans learn to drive, they will 1) make decisions over the high-level skill space instead of the low-level control space and 2) leverage expert prior knowledge rather than learning from scratch. Inspired by this, we propose ASAP-RL, an efficient reinforcement learning algorithm for autonomous driving that simultaneously leverages motion skills and expert priors. We first parameterized motion skills, which are diverse enough to cover various complex driving scenarios and situations. A skill parameter inverse recovery method is proposed to convert expert demonstrations from control space to skill space. A simple but effective double initialization technique is proposed to leverage expert priors while bypassing the issue of expert suboptimality and early performance degradation. We validate our proposed method on interactive dense-traffic driving tasks given simple and sparse rewards. Experimental results show that our method can lead to higher learning efficiency and better driving performance relative to previous methods that exploit skills and priors differently. Code is open-sourced to facilitate further research. △ Less

Submitted 7 May, 2023; originally announced May 2023.

Comments: Robotics: Science and Systems (RSS 2023)

arXiv:2304.14460 [pdf, other]

Gradient-based Maximally Interfered Retrieval for Domain Incremental 3D Object Detection

Authors: Barza Nisar, Hruday Vishal Kanna Anand, Steven L. Waslander

Abstract: Accurate 3D object detection in all weather conditions remains a key challenge to enable the widespread deployment of autonomous vehicles, as most work to date has been performed on clear weather data. In order to generalize to adverse weather conditions, supervised methods perform best if trained from scratch on all weather data instead of finetuning a model pretrained on clear weather data. Trai… ▽ More Accurate 3D object detection in all weather conditions remains a key challenge to enable the widespread deployment of autonomous vehicles, as most work to date has been performed on clear weather data. In order to generalize to adverse weather conditions, supervised methods perform best if trained from scratch on all weather data instead of finetuning a model pretrained on clear weather data. Training from scratch on all data will eventually become computationally infeasible and expensive as datasets continue to grow and encompass the full extent of possible weather conditions. On the other hand, naive finetuning on data from a different weather domain can result in catastrophic forgetting of the previously learned domain. Inspired by the success of replay-based continual learning methods, we propose Gradient-based Maximally Interfered Retrieval (GMIR), a gradient based sampling strategy for replay. During finetuning, GMIR periodically retrieves samples from the previous domain dataset whose gradient vectors show maximal interference with the gradient vector of the current update. Our 3D object detection experiments on the SeeingThroughFog (STF) dataset show that GMIR not only overcomes forgetting but also offers competitive performance compared to scratch training on all data with a 46.25% reduction in total training time. △ Less

Submitted 3 May, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

arXiv:2304.14446 [pdf, other]

HyperMODEST: Self-Supervised 3D Object Detection with Confidence Score Filtering

Authors: Jenny Xu, Steven L. Waslander

Abstract: Current LiDAR-based 3D object detectors for autonomous driving are almost entirely trained on human-annotated data collected in specific geographical domains with specific sensor setups, making it difficult to adapt to a different domain. MODEST is the first work to train 3D object detectors without any labels. Our work, HyperMODEST, proposes a universal method implemented on top of MODEST that ca… ▽ More Current LiDAR-based 3D object detectors for autonomous driving are almost entirely trained on human-annotated data collected in specific geographical domains with specific sensor setups, making it difficult to adapt to a different domain. MODEST is the first work to train 3D object detectors without any labels. Our work, HyperMODEST, proposes a universal method implemented on top of MODEST that can largely accelerate the self-training process and does not require tuning on a specific dataset. We filter intermediate pseudo-labels used for data augmentation with low confidence scores. On the nuScenes dataset, we observe a significant improvement of 1.6% in AP BEV in 0-80m range at IoU=0.25 and an improvement of 1.7% in AP BEV in 0-80m range at IoU=0.5 while only using one-fifth of the training time in the original approach by MODEST. On the Lyft dataset, we also observe an improvement over the baseline during the first round of iterative self-training. We explore the trade-off between high precision and high recall in the early stage of the self-training process by comparing our proposed method with two other score filtering methods: confidence score filtering for pseudo-labels with and without static label retention. The code and models of this work are available at https://github.com/TRAILab/HyperMODEST △ Less

Submitted 1 June, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

Comments: Accepted in CRV (Conference on Robots and Vision) 2023

arXiv:2303.10729 [pdf, other]

A Target-Based Extrinsic Calibration Framework for Non-Overlapping Camera-Lidar Systems Using a Motion Capture System

Authors: Nicholas Charron, Steven L. Waslander, Sriram Narasimhan

Abstract: In this work, we present a novel target-based lidar-camera extrinsic calibration methodology that can be used for non-overlapping field of view (FOV) sensors. Contrary to previous work, our methodology overcomes the non-overlapping FOV challenge using a motion capture system (MCS) instead of traditional simultaneous localization and mapping approaches. Due to the high relative precision of the MCS… ▽ More In this work, we present a novel target-based lidar-camera extrinsic calibration methodology that can be used for non-overlapping field of view (FOV) sensors. Contrary to previous work, our methodology overcomes the non-overlapping FOV challenge using a motion capture system (MCS) instead of traditional simultaneous localization and mapping approaches. Due to the high relative precision of the MCS, our methodology can achieve both the high accuracy and repeatable calibrations of traditional target-based methods, regardless of the amount of overlap in the field of view of the sensors. We show using simulation that we can accurately recover extrinsic calibrations for a range of perturbations to the true calibration that would be expected in real circumstances. We also validate that high accuracy calibrations can be achieved on experimental data. Furthermore, We implement the described approach in an extensible way that allows any camera model, target shape, or feature extraction methodology to be used within our framework. We validate this implementation on two target shapes: an easy to construct cylinder target and a diamond target with a checkerboard. The cylinder target shape results show that our methodology can be used for degenerate target shapes where target poses cannot be fully constrained from a single observation, and distinct repeatable features need not be detected on the target. △ Less

Submitted 14 June, 2023; v1 submitted 19 March, 2023; originally announced March 2023.

Comments: 8 pages, 15 figures

arXiv:2303.06766 [pdf, other]

Next-Best-View Selection for Robot Eye-in-Hand Calibration

Authors: Jun Yang, Jason Rebello, Steven L. Waslander

Abstract: Robotic eye-in-hand calibration is the task of determining the rigid 6-DoF pose of the camera with respect to the robot end-effector frame. In this paper, we formulate this task as a non-linear optimization problem and introduce an active vision approach to strategically select the robot pose for maximizing calibration accuracy. Specifically, given an initial collection of measurement sets, our sy… ▽ More Robotic eye-in-hand calibration is the task of determining the rigid 6-DoF pose of the camera with respect to the robot end-effector frame. In this paper, we formulate this task as a non-linear optimization problem and introduce an active vision approach to strategically select the robot pose for maximizing calibration accuracy. Specifically, given an initial collection of measurement sets, our system first computes the calibration parameters and estimates the parameter uncertainties. We then predict the next robot pose from which to collect the next measurement that brings about the maximum information gain (uncertainty reduction) in the calibration parameters. We test our approach on a simulated dataset and validate the results on a real 6-axis robot manipulator. The results demonstrate that our approach can achieve accurate calibrations using many fewer viewpoints than other commonly used baseline calibration methods. △ Less

Submitted 12 March, 2023; originally announced March 2023.

arXiv:2301.05709 [pdf, other]

Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss

Authors: Anas Mahmoud, Jordan S. K. Hu, Tianshu Kuai, Ali Harakeh, Liam Paull, Steven L. Waslander

Abstract: An effective framework for learning 3D representations for perception tasks is distilling rich self-supervised image features via contrastive learning. However, image-to point representation learning for autonomous driving datasets faces two main challenges: 1) the abundance of self-similarity, which results in the contrastive losses pushing away semantically similar point and image regions and th… ▽ More An effective framework for learning 3D representations for perception tasks is distilling rich self-supervised image features via contrastive learning. However, image-to point representation learning for autonomous driving datasets faces two main challenges: 1) the abundance of self-similarity, which results in the contrastive losses pushing away semantically similar point and image regions and thus disturbing the local semantic structure of the learned representations, and 2) severe class imbalance as pretraining gets dominated by over-represented classes. We propose to alleviate the self-similarity problem through a novel semantically tolerant image-to-point contrastive loss that takes into consideration the semantic distance between positive and negative image regions to minimize contrasting semantically similar point and image regions. Additionally, we address class imbalance by designing a class-agnostic balanced loss that approximates the degree of class imbalance through an aggregate sample-to-samples semantic similarity measure. We demonstrate that our semantically-tolerant contrastive loss with class balancing improves state-of-the art 2D-to-3D representation learning in all evaluation settings on 3D semantic segmentation. Our method consistently outperforms state-of-the-art 2D-to-3D representation learning frameworks across a wide range of 2D self-supervised pretrained models. △ Less

Submitted 24 March, 2023; v1 submitted 12 January, 2023; originally announced January 2023.

Comments: Accepted in CVPR 2023

arXiv:2211.13724 [pdf, other]

Estimating Regression Predictive Distributions with Sample Networks

Authors: Ali Harakeh, Jordan Hu, Naiqing Guan, Steven L. Waslander, Liam Paull

Abstract: Estimating the uncertainty in deep neural network predictions is crucial for many real-world applications. A common approach to model uncertainty is to choose a parametric distribution and fit the data to it using maximum likelihood estimation. The chosen parametric form can be a poor fit to the data-generating distribution, resulting in unreliable uncertainty estimates. In this work, we propose S… ▽ More Estimating the uncertainty in deep neural network predictions is crucial for many real-world applications. A common approach to model uncertainty is to choose a parametric distribution and fit the data to it using maximum likelihood estimation. The chosen parametric form can be a poor fit to the data-generating distribution, resulting in unreliable uncertainty estimates. In this work, we propose SampleNet, a flexible and scalable architecture for modeling uncertainty that avoids specifying a parametric form on the output distribution. SampleNets do so by defining an empirical distribution using samples that are learned with the Energy Score and regularized with the Sinkhorn Divergence. SampleNets are shown to be able to well-fit a wide range of distributions and to outperform baselines on large-scale real-world regression tasks. △ Less

Submitted 24 November, 2022; originally announced November 2022.

Comments: Accepted for publication in AAAI 2023. Example code at: https://samplenet.github.io/

arXiv:2210.11554 [pdf, other]

6D Pose Estimation for Textureless Objects on RGB Frames using Multi-View Optimization

Authors: Jun Yang, Wenjie Xue, Sahar Ghavidel, Steven L. Waslander

Abstract: 6D pose estimation of textureless objects is a valuable but challenging task for many robotic applications. In this work, we propose a framework to address this challenge using only RGB images acquired from multiple viewpoints. The core idea of our approach is to decouple 6D pose estimation into a sequential two-step process, first estimating the 3D translation and then the 3D rotation of each obj… ▽ More 6D pose estimation of textureless objects is a valuable but challenging task for many robotic applications. In this work, we propose a framework to address this challenge using only RGB images acquired from multiple viewpoints. The core idea of our approach is to decouple 6D pose estimation into a sequential two-step process, first estimating the 3D translation and then the 3D rotation of each object. This decoupled formulation first resolves the scale and depth ambiguities in single RGB images, and uses these estimates to accurately identify the object orientation in the second stage, which is greatly simplified with an accurate scale estimate. Moreover, to accommodate the multi-modal distribution present in rotation space, we develop an optimization scheme that explicitly handles object symmetries and counteracts measurement uncertainties. In comparison to the state-of-the-art multi-view approach, we demonstrate that the proposed approach achieves substantial improvements on a challenging 6D pose estimation dataset for textureless objects. △ Less

Submitted 21 February, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

arXiv:2208.08041 [pdf, other]

InterTrack: Interaction Transformer for 3D Multi-Object Tracking

Authors: John Willes, Cody Reading, Steven L. Waslander

Abstract: 3D multi-object tracking (MOT) is a key problem for autonomous vehicles, required to perform well-informed motion planning in dynamic environments. Particularly for densely occupied scenes, associating existing tracks to new detections remains challenging as existing systems tend to omit critical contextual information. Our proposed solution, InterTrack, introduces the Interaction Transformer for… ▽ More 3D multi-object tracking (MOT) is a key problem for autonomous vehicles, required to perform well-informed motion planning in dynamic environments. Particularly for densely occupied scenes, associating existing tracks to new detections remains challenging as existing systems tend to omit critical contextual information. Our proposed solution, InterTrack, introduces the Interaction Transformer for 3D MOT to generate discriminative object representations for data association. We extract state and shape features for each track and detection, and efficiently aggregate global information via attention. We then perform a learned regression on each track/detection feature pair to estimate affinities, and use a robust two-stage data association and track management approach to produce the final tracks. We validate our approach on the nuScenes 3D MOT benchmark, where we observe significant improvements, particularly on classes with small physical sizes and clustered objects. As of submission, InterTrack ranks 1st in overall AMOTA among methods using CenterPoint detections. △ Less

Submitted 6 May, 2023; v1 submitted 16 August, 2022; originally announced August 2022.

Comments: Accepted to CRV 2023

arXiv:2205.01202 [pdf, other]

POCD: Probabilistic Object-Level Change Detection and Volumetric Mapping in Semi-Static Scenes

Authors: Jingxing Qian, Veronica Chatrath, Jun Yang, James Servos, Angela P. Schoellig, Steven L. Waslander

Abstract: Maintaining an up-to-date map to reflect recent changes in the scene is very important, particularly in situations involving repeated traversals by a robot operating in an environment over an extended period. Undetected changes may cause a deterioration in map quality, leading to poor localization, inefficient operations, and lost robots. Volumetric methods, such as truncated signed distance funct… ▽ More Maintaining an up-to-date map to reflect recent changes in the scene is very important, particularly in situations involving repeated traversals by a robot operating in an environment over an extended period. Undetected changes may cause a deterioration in map quality, leading to poor localization, inefficient operations, and lost robots. Volumetric methods, such as truncated signed distance functions (TSDFs), have quickly gained traction due to their real-time production of a dense and detailed map, though map updating in scenes that change over time remains a challenge. We propose a framework that introduces a novel probabilistic object state representation to track object pose changes in semi-static scenes. The representation jointly models a stationarity score and a TSDF change measure for each object. A Bayesian update rule that incorporates both geometric and semantic information is derived to achieve consistent online map maintenance. To extensively evaluate our approach alongside the state-of-the-art, we release a novel real-world dataset in a warehouse environment. We also evaluate on the public ToyCar dataset. Our method outperforms state-of-the-art methods on the reconstruction quality of semi-static environments. △ Less

Submitted 15 July, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

Comments: Published in Robotics: Science and Systems (RSS) 2022

arXiv:2203.05662 [pdf, other]

Point Density-Aware Voxels for LiDAR 3D Object Detection

Authors: Jordan S. K. Hu, Tianshu Kuai, Steven L. Waslander

Abstract: LiDAR has become one of the primary 3D object detection sensors in autonomous driving. However, LiDAR's diverging point pattern with increasing distance results in a non-uniform sampled point cloud ill-suited to discretized volumetric feature extraction. Current methods either rely on voxelized point clouds or use inefficient farthest point sampling to mitigate detrimental effects caused by densit… ▽ More LiDAR has become one of the primary 3D object detection sensors in autonomous driving. However, LiDAR's diverging point pattern with increasing distance results in a non-uniform sampled point cloud ill-suited to discretized volumetric feature extraction. Current methods either rely on voxelized point clouds or use inefficient farthest point sampling to mitigate detrimental effects caused by density variation but largely ignore point density as a feature and its predictable relationship with distance from the LiDAR sensor. Our proposed solution, Point Density-Aware Voxel network (PDV), is an end-to-end two stage LiDAR 3D object detection architecture that is designed to account for these point density variations. PDV efficiently localizes voxel features from the 3D sparse convolution backbone through voxel point centroids. The spatially localized voxel features are then aggregated through a density-aware RoI grid pooling module using kernel density estimation (KDE) and self-attention with point density positional encoding. Finally, we exploit LiDAR's point density to distance relationship to refine our final bounding box confidences. PDV outperforms all state-of-the-art methods on the Waymo Open Dataset and achieves competitive results on the KITTI dataset. We provide a code release for PDV which is available at https://github.com/TRAILab/PDV. △ Less

Submitted 21 March, 2022; v1 submitted 10 March, 2022; originally announced March 2022.

Comments: Accepted in CVPR 2022

arXiv:2203.00871 [pdf, other]

Dense Voxel Fusion for 3D Object Detection

Authors: Anas Mahmoud, Jordan S. K. Hu, Steven L. Waslander

Abstract: Camera and LiDAR sensor modalities provide complementary appearance and geometric information useful for detecting 3D objects for autonomous vehicle applications. However, current end-to-end fusion methods are challenging to train and underperform state-of-the-art LiDAR-only detectors. Sequential fusion methods suffer from a limited number of pixel and point correspondences due to point cloud spar… ▽ More Camera and LiDAR sensor modalities provide complementary appearance and geometric information useful for detecting 3D objects for autonomous vehicle applications. However, current end-to-end fusion methods are challenging to train and underperform state-of-the-art LiDAR-only detectors. Sequential fusion methods suffer from a limited number of pixel and point correspondences due to point cloud sparsity, or their performance is strictly capped by the detections of one of the modalities. Our proposed solution, Dense Voxel Fusion (DVF) is a sequential fusion method that generates multi-scale dense voxel feature representations, improving expressiveness in low point density regions. To enhance multi-modal learning, we train directly with projected ground truth 3D bounding box labels, avoiding noisy, detector-specific 2D predictions. Both DVF and the multi-modal training approach can be applied to any voxel-based LiDAR backbone. DVF ranks 3rd among published fusion methods on KITTI 3D car detection benchmark without introducing additional trainable parameters, nor requiring stereo images or dense depth labels. In addition, DVF significantly improves 3D vehicle detection performance of voxel-based methods on the Waymo Open Dataset. △ Less

Submitted 27 October, 2022; v1 submitted 1 March, 2022; originally announced March 2022.

Comments: Accepted in WACV 2023

arXiv:2202.13263 [pdf, other]

Next-Best-View Prediction for Active Stereo Cameras and Highly Reflective Objects

Authors: Jun Yang, Steven L. Waslander

Abstract: Depth acquisition with the active stereo camera is a challenging task for highly reflective objects. When setup permits, multi-view fusion can provide increased levels of depth completion. However, due to the slow acquisition speed of high-end active stereo cameras, collecting a large number of viewpoints for a single scene is generally not practical. In this work, we propose a next-best-view fram… ▽ More Depth acquisition with the active stereo camera is a challenging task for highly reflective objects. When setup permits, multi-view fusion can provide increased levels of depth completion. However, due to the slow acquisition speed of high-end active stereo cameras, collecting a large number of viewpoints for a single scene is generally not practical. In this work, we propose a next-best-view framework to strategically select camera viewpoints for completing depth data on reflective objects. In particular, we explicitly model the specular reflection of reflective surfaces based on the Phong reflection model and a photometric response function. Given the object CAD model and grayscale image, we employ an RGB-based pose estimator to obtain current pose predictions from the existing data, which is used to form predicted surface normal and depth hypotheses, and allows us to then assess the information gain from a subsequent frame for any candidate viewpoint. Using this formulation, we implement an active perception pipeline which is evaluated on a challenging real-world dataset. The evaluation results demonstrate that our active depth acquisition method outperforms two strong baselines for both depth completion and object pose estimation performance. △ Less

Submitted 26 February, 2022; originally announced February 2022.

arXiv:2112.00050 [pdf, other]

doi 10.1109/ITSC48978.2021.9564842

Pattern-Aware Data Augmentation for LiDAR 3D Object Detection

Authors: Jordan S. K. Hu, Steven L. Waslander

Abstract: Autonomous driving datasets are often skewed and in particular, lack training data for objects at farther distances from the ego vehicle. The imbalance of data causes a performance degradation as the distance of the detected objects increases. In this paper, we propose pattern-aware ground truth sampling, a data augmentation technique that downsamples an object's point cloud based on the LiDAR's c… ▽ More Autonomous driving datasets are often skewed and in particular, lack training data for objects at farther distances from the ego vehicle. The imbalance of data causes a performance degradation as the distance of the detected objects increases. In this paper, we propose pattern-aware ground truth sampling, a data augmentation technique that downsamples an object's point cloud based on the LiDAR's characteristics. Specifically, we mimic the natural diverging point pattern variation that occurs for objects at depth to simulate samples at farther distances. Thus, the network has more diverse training examples and can generalize to detecting farther objects more effectively. We evaluate against existing data augmentation techniques that use point removal or perturbation methods and find that our method outperforms all of them. Additionally, we propose using equal element AP bins to evaluate the performance of 3D object detectors across distance. We improve the performance of PV-RCNN on the car class by more than 0.7 percent on the KITTI validation split at distances greater than 25 m. △ Less

Submitted 30 November, 2021; originally announced December 2021.

Comments: Published paper in the IEEE Intelligent Transportation Systems Conference - ITSC 2021

Journal ref: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021, pp. 2703-2710

arXiv:2110.04182 [pdf, other]

Temporal Convolutions for Multi-Step Quadrotor Motion Prediction

Authors: Samuel Looper, Steven L. Waslander

Abstract: Model-based control methods for robotic systems such as quadrotors, autonomous driving vehicles and flexible manipulators require motion models that generate accurate predictions of complex nonlinear system dynamics over long periods of time. Temporal Convolutional Networks (TCNs) can be adapted to this challenge by formulating multi-step prediction as a sequence-to-sequence modeling problem. We p… ▽ More Model-based control methods for robotic systems such as quadrotors, autonomous driving vehicles and flexible manipulators require motion models that generate accurate predictions of complex nonlinear system dynamics over long periods of time. Temporal Convolutional Networks (TCNs) can be adapted to this challenge by formulating multi-step prediction as a sequence-to-sequence modeling problem. We present End2End-TCN: a fully convolutional architecture that integrates future control inputs to compute multi-step motion predictions in one forward pass. We demonstrate the approach with a thorough analysis of TCN performance for the quadrotor modeling task, which includes an investigation of scaling effects and ablation studies. Ultimately, End2End-TCN provides 55% error reduction over the state of the art in multi-step prediction on an aggressive indoor quadrotor flight dataset. The model yields accurate predictions across 90 timestep horizons over a 900 ms interval. △ Less

Submitted 8 October, 2021; originally announced October 2021.

arXiv:2105.04112 [pdf, other]

ROBI: A Multi-View Dataset for Reflective Objects in Robotic Bin-Picking

Authors: Jun Yang, Yizhou Gao, Dong Li, Steven L. Waslander

Abstract: In robotic bin-picking applications, the perception of texture-less, highly reflective parts is a valuable but challenging task. The high glossiness can introduce fake edges in RGB images and inaccurate depth measurements especially in heavily cluttered bin scenario. In this paper, we present the ROBI (Reflective Objects in BIns) dataset, a public dataset for 6D object pose estimation and multi-vi… ▽ More In robotic bin-picking applications, the perception of texture-less, highly reflective parts is a valuable but challenging task. The high glossiness can introduce fake edges in RGB images and inaccurate depth measurements especially in heavily cluttered bin scenario. In this paper, we present the ROBI (Reflective Objects in BIns) dataset, a public dataset for 6D object pose estimation and multi-view depth fusion in robotic bin-picking scenarios. The ROBI dataset includes a total of 63 bin-picking scenes captured with two active stereo camera: a high-cost Ensenso sensor and a low-cost RealSense sensor. For each scene, the monochrome/RGB images and depth maps are captured from sampled view spheres around the scene, and are annotated with accurate 6D poses of visible objects and an associated visibility score. For evaluating the performance of depth fusion, we captured the ground truth depth maps by high-cost Ensenso camera with objects coated in anti-reflective scanning spray. To show the utility of the dataset, we evaluated the representative algorithms of 6D object pose estimation and multi-view depth fusion on the full dataset. Evaluation results demonstrate the difficulty of highly reflective objects, especially in difficult cases due to the degradation of depth data quality, severe occlusions and cluttered scene. The ROBI dataset is available online at https://www.trailab.utias.utoronto.ca/robi. △ Less

Submitted 6 October, 2021; v1 submitted 10 May, 2021; originally announced May 2021.

arXiv:2103.10968 [pdf, other]

Probabilistic Multi-View Fusion of Active Stereo Depth Maps for Robotic Bin-Picking

Authors: Jun Yang, Dong Li, Steven L. Waslander

Abstract: The reliable fusion of depth maps from multiple viewpoints has become an important problem in many 3D reconstruction pipelines. In this work, we investigate its impact on robotic bin-picking tasks such as 6D object pose estimation. The performance of object pose estimation relies heavily on the quality of depth data. However, due to the prevalence of shiny surfaces and cluttered scenes, industrial… ▽ More The reliable fusion of depth maps from multiple viewpoints has become an important problem in many 3D reconstruction pipelines. In this work, we investigate its impact on robotic bin-picking tasks such as 6D object pose estimation. The performance of object pose estimation relies heavily on the quality of depth data. However, due to the prevalence of shiny surfaces and cluttered scenes, industrial grade depth cameras often fail to sense depth or generate unreliable measurements from a single viewpoint. To this end, we propose a novel probabilistic framework for scene reconstruction in robotic bin-picking. Based on active stereo camera data, we first explicitly estimate the uncertainty of depth measurements for mitigating the adverse effects of both noise and outliers. The uncertainty estimates are then incorporated into a probabilistic model for incrementally updating the scene. To extensively evaluate the traditional fusion approach alongside our own approach, we will release a novel representative dataset with multiple views for each bin and curated parts. Over the entire dataset, we demonstrate that our framework outperforms a traditional fusion approach by a 12.8% reduction in reconstruction error, and 6.1% improvement in detection rate. The dataset will be available at https://www.trailab.utias.utoronto.ca/robi. △ Less

Submitted 19 March, 2021; originally announced March 2021.

arXiv:2103.01100 [pdf, other]

Categorical Depth Distribution Network for Monocular 3D Object Detection

Authors: Cody Reading, Ali Harakeh, Julia Chae, Steven L. Waslander

Abstract: Monocular 3D object detection is a key problem for autonomous vehicles, as it provides a solution with simple configuration compared to typical multi-sensor systems. The main challenge in monocular 3D detection lies in accurately predicting object depth, which must be inferred from object and scene cues due to the lack of direct range measurement. Many methods attempt to directly estimate depth to… ▽ More Monocular 3D object detection is a key problem for autonomous vehicles, as it provides a solution with simple configuration compared to typical multi-sensor systems. The main challenge in monocular 3D detection lies in accurately predicting object depth, which must be inferred from object and scene cues due to the lack of direct range measurement. Many methods attempt to directly estimate depth to assist in 3D detection, but show limited performance as a result of depth inaccuracy. Our proposed solution, Categorical Depth Distribution Network (CaDDN), uses a predicted categorical depth distribution for each pixel to project rich contextual feature information to the appropriate depth interval in 3D space. We then use the computationally efficient bird's-eye-view projection and single-stage detector to produce the final output bounding boxes. We design CaDDN as a fully differentiable end-to-end approach for joint depth estimation and object detection. We validate our approach on the KITTI 3D object detection benchmark, where we rank 1st among published monocular methods. We also provide the first monocular 3D detection results on the newly released Waymo Open Dataset. We provide a code release for CaDDN which is made available. △ Less

Submitted 23 March, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

Comments: Accepted in CVPR 2021

arXiv:2102.04341 [pdf, other]

doi 10.1109/LRA.2021.3058909

Learned Camera Gain and Exposure Control for Improved Visual Feature Detection and Matching

Authors: Justin Tomasi, Brandon Wagstaff, Steven L. Waslander, Jonathan Kelly

Abstract: Successful visual navigation depends upon capturing images that contain sufficient useful information. In this letter, we explore a data-driven approach to account for environmental lighting changes, improving the quality of images for use in visual odometry (VO) or visual simultaneous localization and mapping (SLAM). We train a deep convolutional neural network model to predictively adjust camera… ▽ More Successful visual navigation depends upon capturing images that contain sufficient useful information. In this letter, we explore a data-driven approach to account for environmental lighting changes, improving the quality of images for use in visual odometry (VO) or visual simultaneous localization and mapping (SLAM). We train a deep convolutional neural network model to predictively adjust camera gain and exposure time parameters such that consecutive images contain a maximal number of matchable features. The training process is fully self-supervised: our training signal is derived from an underlying VO or SLAM pipeline and, as a result, the model is optimized to perform well with that specific pipeline. We demonstrate through extensive real-world experiments that our network can anticipate and compensate for dramatic lighting changes (e.g., transitions into and out of road tunnels), maintaining a substantially higher number of inlier feature matches than competing camera parameter control algorithms. △ Less

Submitted 11 July, 2022; v1 submitted 8 February, 2021; originally announced February 2021.

Comments: In IEEE Robotics and Automation Letters (RA-L) and presented at the IEEE International Conference on Robotics and Automation (ICRA'21), Xi'an, China, May 30-Jun. 5, 2021

Journal ref: IEEE Robotics and Automation Letters (RA-L), Vol. 6, No. 2, pp. 2028-2035, Apr. 2021

arXiv:2101.05036 [pdf, other]

Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors

Authors: Ali Harakeh, Steven L. Waslander

Abstract: Predictive uncertainty estimation is an essential next step for the reliable deployment of deep object detectors in safety-critical tasks. In this work, we focus on estimating predictive distributions for bounding box regression output with variance networks. We show that in the context of object detection, training variance networks with negative log likelihood (NLL) can lead to high entropy pred… ▽ More Predictive uncertainty estimation is an essential next step for the reliable deployment of deep object detectors in safety-critical tasks. In this work, we focus on estimating predictive distributions for bounding box regression output with variance networks. We show that in the context of object detection, training variance networks with negative log likelihood (NLL) can lead to high entropy predictive distributions regardless of the correctness of the output mean. We propose to use the energy score as a non-local proper scoring rule and find that when used for training, the energy score leads to better calibrated and lower entropy predictive distributions than NLL. We also address the widespread use of non-proper scoring metrics for evaluating predictive distributions from deep object detectors by proposing an alternate evaluation approach founded on proper scoring rules. Using the proposed evaluation tools, we show that although variance networks can be used to produce high quality predictive distributions, ad-hoc approaches used by seminal object detectors for choosing regression targets during training do not provide wide enough data support for reliable variance learning. We hope that our work helps shift evaluation in probabilistic object detection to better align with predictive uncertainty evaluation in other machine learning domains. Code for all models, evaluation, and datasets is available at: https://github.com/asharakeh/probdet.git. △ Less

Submitted 12 March, 2021; v1 submitted 13 January, 2021; originally announced January 2021.

Comments: Published as a conference paper at ICLR 2021. Link: https://openreview.net/forum?id=YLewtnvKgR7. This is the final camera-ready version

arXiv:2012.00218 [pdf, ps, other]

Uncertainty-Constrained Differential Dynamic Programming in Belief Space for Vision Based Robots

Authors: Shatil Rahman, Steven L. Waslander

Abstract: Most mobile robots follow a modular sense-planact system architecture that can lead to poor performance or even catastrophic failure for visual inertial navigation systems due to trajectories devoid of feature matches. Planning in belief space provides a unified approach to tightly couple the perception, planning and control modules, leading to trajectories that are robust to noisy measurements an… ▽ More Most mobile robots follow a modular sense-planact system architecture that can lead to poor performance or even catastrophic failure for visual inertial navigation systems due to trajectories devoid of feature matches. Planning in belief space provides a unified approach to tightly couple the perception, planning and control modules, leading to trajectories that are robust to noisy measurements and disturbances. However, existing methods handle uncertainties as costs that require manual tuning for varying environments and hardware. We therefore propose a novel trajectory optimization formulation that incorporates inequality constraints on uncertainty and a novel Augmented Lagrangian based stochastic differential dynamic programming method in belief space. Furthermore, we develop a probabilistic visibility model that accounts for discontinuities due to feature visibility limits. Our simulation tests demonstrate that our method can handle inequality constraints in different environments, for holonomic and nonholonomic motion models with no manual tuning of uncertainty costs involved. We also show the improved optimization performance in belief space due to our visibility model. △ Less

Submitted 30 November, 2020; originally announced December 2020.

Comments: This work has been submitted to the 2021 IEEE International Conference on Robotics and Automation (ICRA) with the Robotics and Automation Letters (RA-L) option for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2009.08577 [pdf]

Making Sense of the Robotized Pandemic Response: A Comparison of Global and Canadian Robot Deployments and Success Factors

Authors: T. Barfoot, J. Burgner-Kahrs, E. Diller, A. Garg, A. Goldenberg, J. Kelly, X. Liu, H. E. Naguib, G. Nejat, A. P. Schoellig, F. Shkurti, H. Siegel, Y. Sun, S. L. Waslander, .

Abstract: From disinfection and remote triage, to logistics and delivery, countries around the world are making use of robots to address the unique challenges presented by the COVID-19 pandemic. Robots are being used to manage the pandemic in Canada too, but relative to other regions, we have been more cautious in our adoption -- this despite the important role that robots of Canadian origin are now playing… ▽ More From disinfection and remote triage, to logistics and delivery, countries around the world are making use of robots to address the unique challenges presented by the COVID-19 pandemic. Robots are being used to manage the pandemic in Canada too, but relative to other regions, we have been more cautious in our adoption -- this despite the important role that robots of Canadian origin are now playing on the global stage. This white paper discusses why this is the case, and argues that strategic investment and support for the Canadian robotics industry are urgently needed to bring the benefits of robotics home, where we have more control in shaping the future of this game-changing technology. Such investments will not only serve to support Canada's current pandemic response and post pandemic recovery, but will also prepare this country to weather future crises. Without such support, Canada risks falling behind other developed nations that are investing heavily in hardware automation at this time. △ Less

Submitted 21 September, 2020; v1 submitted 17 September, 2020; originally announced September 2020.

Comments: 104 pages, 18 figures, 13 tables. Corresponding Author: H Siegel

arXiv:2003.05505 [pdf, other]

Confidence Guided Stereo 3D Object Detection with Split Depth Estimation

Authors: Chengyao Li, Jason Ku, Steven L. Waslander

Abstract: Accurate and reliable 3D object detection is vital to safe autonomous driving. Despite recent developments, the performance gap between stereo-based methods and LiDAR-based methods is still considerable. Accurate depth estimation is crucial to the performance of stereo-based 3D object detection methods, particularly for those pixels associated with objects in the foreground. Moreover, stereo-based… ▽ More Accurate and reliable 3D object detection is vital to safe autonomous driving. Despite recent developments, the performance gap between stereo-based methods and LiDAR-based methods is still considerable. Accurate depth estimation is crucial to the performance of stereo-based 3D object detection methods, particularly for those pixels associated with objects in the foreground. Moreover, stereo-based methods suffer from high variance in the depth estimation accuracy, which is often not considered in the object detection pipeline. To tackle these two issues, we propose CG-Stereo, a confidence-guided stereo 3D object detection pipeline that uses separate decoders for foreground and background pixels during depth estimation, and leverages the confidence estimation from the depth estimation network as a soft attention mechanism in the 3D object detector. Our approach outperforms all state-of-the-art stereo-based 3D detectors on the KITTI benchmark. △ Less

Submitted 11 March, 2020; originally announced March 2020.

Comments: 8 pages, 6 figures

arXiv:2001.09297 [pdf, other]

Vehicle Scheduling Problem

Authors: Mirmojtaba Gharibi, Steven L. Waslander, Raouf Boutaba

Abstract: We define a new problem called the Vehicle Scheduling Problem (VSP). The goal is to minimize an objective function, such as the number of tardy vehicles over a transportation network subject to maintaining safety distances, meeting hard deadlines, and maintaining speeds on each link between the allowed minimums and maximums. We prove VSP is an NP-hard problem for multiple objective functions that… ▽ More We define a new problem called the Vehicle Scheduling Problem (VSP). The goal is to minimize an objective function, such as the number of tardy vehicles over a transportation network subject to maintaining safety distances, meeting hard deadlines, and maintaining speeds on each link between the allowed minimums and maximums. We prove VSP is an NP-hard problem for multiple objective functions that are commonly used in the context of job shop scheduling. With the number of tardy vehicles as the objective function, we formulate VSP in terms of a Mixed Integer Linear Programming (MIP) and design a heuristic algorithm. We analyze the complexity of our algorithm and compare the quality of the solutions to the optimal solution for the MIP formulation in the small cases. Our main motivation for defining VSP is the upcoming integration of Unmanned Aerial Vehicles (UAVs) into the airspace for which this novel scheduling framework is of paramount importance. △ Less

Submitted 25 January, 2020; originally announced January 2020.

arXiv:1909.08537 [pdf, other]

Visual Measurement Integrity Monitoring for UAV Localization

Authors: Chengyao Li, Steven L. Waslander

Abstract: Unmanned aerial vehicles (UAVs) have increasingly been adopted for safety, security, and rescue missions, for which they need precise and reliable pose estimates relative to their environment. To ensure mission safety when relying on visual perception, it is essential to have an approach to assess the integrity of the visual localization solution. However, to the best of our knowledge, such an app… ▽ More Unmanned aerial vehicles (UAVs) have increasingly been adopted for safety, security, and rescue missions, for which they need precise and reliable pose estimates relative to their environment. To ensure mission safety when relying on visual perception, it is essential to have an approach to assess the integrity of the visual localization solution. However, to the best of our knowledge, such an approach does not exist for optimization-based visual localization. Receiver autonomous integrity monitoring (RAIM) has been widely used in global navigation satellite systems (GNSS) applications such as automated aircraft landing. In this paper, we propose a novel approach inspired by RAIM to monitor the integrity of optimization-based visual localization and calculate the protection level of a state estimate, i.e. the largest possible translational error in each direction. We also propose a metric that quantitatively evaluates the performance of the error bounds. Finally, we validate the protection level using the EuRoC dataset and demonstrate that the proposed protection level provides a significantly more reliable bound than the commonly used $3σ$ method. △ Less

Submitted 18 September, 2019; originally announced September 2019.

Comments: Published in Safety, Security, and Rescue Robotics 2019

arXiv:1909.07566 [pdf, other]

Object-Centric Stereo Matching for 3D Object Detection

Authors: Alex D. Pon, Jason Ku, Chengyao Li, Steven L. Waslander

Abstract: Safe autonomous driving requires reliable 3D object detection-determining the 6 DoF pose and dimensions of objects of interest. Using stereo cameras to solve this task is a cost-effective alternative to the widely used LiDAR sensor. The current state-of-the-art for stereo 3D object detection takes the existing PSMNet stereo matching network, with no modifications, and converts the estimated dispar… ▽ More Safe autonomous driving requires reliable 3D object detection-determining the 6 DoF pose and dimensions of objects of interest. Using stereo cameras to solve this task is a cost-effective alternative to the widely used LiDAR sensor. The current state-of-the-art for stereo 3D object detection takes the existing PSMNet stereo matching network, with no modifications, and converts the estimated disparities into a 3D point cloud, and feeds this point cloud into a LiDAR-based 3D object detector. The issue with existing stereo matching networks is that they are designed for disparity estimation, not 3D object detection; the shape and accuracy of object point clouds are not the focus. Stereo matching networks commonly suffer from inaccurate depth estimates at object boundaries, which we define as streaking, because background and foreground points are jointly estimated. Existing networks also penalize disparity instead of the estimated position of object point clouds in their loss functions. We propose a novel 2D box association and object-centric stereo matching method that only estimates the disparities of the objects of interest to address these two issues. Our method achieves state-of-the-art results on the KITTI 3D and BEV benchmarks. △ Less

Submitted 10 March, 2020; v1 submitted 16 September, 2019; originally announced September 2019.

Comments: Accepted in ICRA 2020

arXiv:1909.04838 [pdf, other]

3D traffic flow model for UAVs

Authors: Mirmojtaba Gharibi, Raouf Boutaba, Steven L. Waslander

Abstract: In this work, we introduce a microscopic traffic flow model called Scalar Capacity Model (SCM) which can be used to study the formation of traffic on an airway link for autonomous Unmanned Aerial Vehicles (UAV) as well as for the ground vehicles on the road. Given the 3D nature of UAV flights, the main novelty in our model is to eliminate the commonly used notion of lanes and replace it with a not… ▽ More In this work, we introduce a microscopic traffic flow model called Scalar Capacity Model (SCM) which can be used to study the formation of traffic on an airway link for autonomous Unmanned Aerial Vehicles (UAV) as well as for the ground vehicles on the road. Given the 3D nature of UAV flights, the main novelty in our model is to eliminate the commonly used notion of lanes and replace it with a notion of density and capacity of flow, but in such a way that individual vehicle motions can still be modeled. We name this a Density/Capacity View (DCV) of the link capacity and how vehicles utilize it versus the traditional One/Multi-Lane View (OMV). An interesting feature of this model is exhibiting both passing and blocking regimes (analogous to multi-lane or single-lane) depending on the set scalar parameter for capacity. We show the model has linear local (platoon) and string stability. Also, we perform numerical simulations and show evidence for non-linear stability. Our traffic flow model is represented by a nonlinear differential equation which we transform into a linear form. This makes our model analytically solvable in the blocking regime and piece-wise analytically solvable in the passing regime. △ Less

Submitted 10 September, 2019; originally announced September 2019.

Comments: 1 Table, 6 Figures

arXiv:1907.06777 [pdf, other]

Improving 3D Object Detection for Pedestrians with Virtual Multi-View Synthesis Orientation Estimation

Authors: Jason Ku, Alex D. Pon, Sean Walsh, Steven L. Waslander

Abstract: Accurately estimating the orientation of pedestrians is an important and challenging task for autonomous driving because this information is essential for tracking and predicting pedestrian behavior. This paper presents a flexible Virtual Multi-View Synthesis module that can be adopted into 3D object detection methods to improve orientation estimation. The module uses a multi-step process to acqui… ▽ More Accurately estimating the orientation of pedestrians is an important and challenging task for autonomous driving because this information is essential for tracking and predicting pedestrian behavior. This paper presents a flexible Virtual Multi-View Synthesis module that can be adopted into 3D object detection methods to improve orientation estimation. The module uses a multi-step process to acquire the fine-grained semantic information required for accurate orientation estimation. First, the scene's point cloud is densified using a structure preserving depth completion algorithm and each point is colorized using its corresponding RGB pixel. Next, virtual cameras are placed around each object in the densified point cloud to generate novel viewpoints, which preserve the object's appearance. We show that this module greatly improves the orientation estimation on the challenging pedestrian class on the KITTI benchmark. When used with the open-source 3D detector AVOD-FPN, we outperform all other published methods on the pedestrian Orientation, 3D, and Bird's Eye View benchmarks. △ Less

Submitted 15 July, 2019; originally announced July 2019.

Comments: Accepted in IROS 2019

arXiv:1905.08758 [pdf, other]

aUToTrack: A Lightweight Object Detection and Tracking System for the SAE AutoDrive Challenge

Authors: Keenan Burnett, Sepehr Samavi, Steven L. Waslander, Timothy D. Barfoot, Angela P. Schoellig

Abstract: The University of Toronto is one of eight teams competing in the SAE AutoDrive Challenge -- a competition to develop a self-driving car by 2020. After placing first at the Year 1 challenge, we are headed to MCity in June 2019 for the second challenge. There, we will interact with pedestrians, cyclists, and cars. For safe operation, it is critical to have an accurate estimate of the position of all… ▽ More The University of Toronto is one of eight teams competing in the SAE AutoDrive Challenge -- a competition to develop a self-driving car by 2020. After placing first at the Year 1 challenge, we are headed to MCity in June 2019 for the second challenge. There, we will interact with pedestrians, cyclists, and cars. For safe operation, it is critical to have an accurate estimate of the position of all objects surrounding the vehicle. The contributions of this work are twofold: First, we present a new object detection and tracking dataset (UofTPed50), which uses GPS to ground truth the position and velocity of a pedestrian. To our knowledge, a dataset of this type for pedestrians has not been shown in the literature before. Second, we present a lightweight object detection and tracking system (aUToTrack) that uses vision, LIDAR, and GPS/IMU positioning to achieve state-of-the-art performance on the KITTI Object Tracking benchmark. We show that aUToTrack accurately estimates the position and velocity of pedestrians, in real-time, using CPUs only. aUToTrack has been tested in closed-loop experiments on a real self-driving car, and we demonstrate its performance on our dataset. △ Less

Submitted 21 May, 2019; originally announced May 2019.

Comments: Accepted to CRV (Computer and Robot Vision) 2019

arXiv:1904.01690 [pdf, other]

Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction

Authors: Jason Ku, Alex D. Pon, Steven L. Waslander

Abstract: We present MonoPSR, a monocular 3D object detection method that leverages proposals and shape reconstruction. First, using the fundamental relations of a pinhole camera model, detections from a mature 2D object detector are used to generate a 3D proposal per object in a scene. The 3D location of these proposals prove to be quite accurate, which greatly reduces the difficulty of regressing the fina… ▽ More We present MonoPSR, a monocular 3D object detection method that leverages proposals and shape reconstruction. First, using the fundamental relations of a pinhole camera model, detections from a mature 2D object detector are used to generate a 3D proposal per object in a scene. The 3D location of these proposals prove to be quite accurate, which greatly reduces the difficulty of regressing the final 3D bounding box detection. Simultaneously, a point cloud is predicted in an object centered coordinate system to learn local scale and shape information. However, the key challenge is how to exploit shape information to guide 3D localization. As such, we devise aggregate losses, including a novel projection alignment loss, to jointly optimize these tasks in the neural network to improve 3D localization accuracy. We validate our method on the KITTI benchmark where we set new state-of-the-art results among published monocular methods, including the harder pedestrian and cyclist classes, while maintaining efficient run-time. △ Less

Submitted 2 April, 2019; originally announced April 2019.

Comments: Accepted in CVPR 2019

arXiv:1903.03838 [pdf, other]

BayesOD: A Bayesian Approach for Uncertainty Estimation in Deep Object Detectors

Authors: Ali Harakeh, Michael Smart, Steven L. Waslander

Abstract: When incorporating deep neural networks into robotic systems, a major challenge is the lack of uncertainty measures associated with their output predictions. Methods for uncertainty estimation in the output of deep object detectors (DNNs) have been proposed in recent works, but have had limited success due to 1) information loss at the detectors non-maximum suppression (NMS) stage, and 2) failure… ▽ More When incorporating deep neural networks into robotic systems, a major challenge is the lack of uncertainty measures associated with their output predictions. Methods for uncertainty estimation in the output of deep object detectors (DNNs) have been proposed in recent works, but have had limited success due to 1) information loss at the detectors non-maximum suppression (NMS) stage, and 2) failure to take into account the multitask, many-to-one nature of anchor-based object detection. To that end, we introduce BayesOD, an uncertainty estimation approach that reformulates the standard object detector inference and Non-Maximum suppression components from a Bayesian perspective. Experiments performed on four common object detection datasets show that BayesOD provides uncertainty estimates that are better correlated with the accuracy of detections, manifesting as a significant reduction of 9.77\%-13.13\% on the minimum Gaussian uncertainty error metric and a reduction of 1.63\%-5.23\% on the minimum Categorical uncertainty error metric. Code will be released at {\url{https://github.com/asharakeh/bayes-od-rc}}. △ Less

Submitted 16 September, 2019; v1 submitted 9 March, 2019; originally announced March 2019.

arXiv:1811.11946 [pdf, other]

doi 10.1109/CRV.2019.00024

Network Uncertainty Informed Semantic Feature Selection for Visual SLAM

Authors: Pranav Ganti, Steven L. Waslander

Abstract: In order to facilitate long-term localization using a visual simultaneous localization and mapping (SLAM) algorithm, careful feature selection can help ensure that reference points persist over long durations and the runtime and storage complexity of the algorithm remain consistent. We present SIVO (Semantically Informed Visual Odometry and Mapping), a novel information-theoretic feature selection… ▽ More In order to facilitate long-term localization using a visual simultaneous localization and mapping (SLAM) algorithm, careful feature selection can help ensure that reference points persist over long durations and the runtime and storage complexity of the algorithm remain consistent. We present SIVO (Semantically Informed Visual Odometry and Mapping), a novel information-theoretic feature selection method for visual SLAM which incorporates semantic segmentation and neural network uncertainty into the feature selection pipeline. Our algorithm selects points which provide the highest reduction in Shannon entropy between the entropy of the current state and the joint entropy of the state, given the addition of the new feature with the classification entropy of the feature from a Bayesian neural network. Each selected feature significantly reduces the uncertainty of the vehicle state and has been detected to be a static object (building, traffic sign, etc.) repeatedly with a high confidence. This selection strategy generates a sparse map which can facilitate long-term localization. The KITTI odometry dataset is used to evaluate our method, and we also compare our results against ORB_SLAM2. Overall, SIVO performs comparably to the baseline method while reducing the map size by almost 70%. △ Less

Submitted 26 August, 2019; v1 submitted 28 November, 2018; originally announced November 2018.

Comments: Published in: 2019 16th Conference on Computer and Robot Vision (CRV)

arXiv:1807.09532

doi 10.1016/j.isprsjprs.2018.11.011

Aerial Imagery for Roof Segmentation: A Large-Scale Dataset towards Automatic Mapping of Buildings

Authors: Qi Chen, Lei Wang, Yifan Wu, Guangming Wu, Zhiling Guo, Steven L. Waslander

Abstract: arXiv admin note: This version has been removed as the user did not have the right to agree to the license at the time of submission arXiv admin note: This version has been removed as the user did not have the right to agree to the license at the time of submission △ Less

Submitted 27 July, 2018; v1 submitted 25 July, 2018; originally announced July 2018.

Comments: arXiv admin note: This version has been removed as the user did not have the right to agree to the license at the time of submission

arXiv:1807.09304 [pdf, other]

Encoderless Gimbal Calibration of Dynamic Multi-Camera Clusters

Authors: Christopher L. Choi, Jason Rebello, Leonid Koppel, Pranav Ganti, Arun Das, Steven L. Waslander

Abstract: Dynamic Camera Clusters (DCCs) are multi-camera systems where one or more cameras are mounted on actuated mechanisms such as a gimbal. Existing methods for DCC calibration rely on joint angle measurements to resolve the time-varying transformation between the dynamic and static camera. This information is usually provided by motor encoders, however, joint angle measurements are not always readily… ▽ More Dynamic Camera Clusters (DCCs) are multi-camera systems where one or more cameras are mounted on actuated mechanisms such as a gimbal. Existing methods for DCC calibration rely on joint angle measurements to resolve the time-varying transformation between the dynamic and static camera. This information is usually provided by motor encoders, however, joint angle measurements are not always readily available on off-the-shelf mechanisms. In this paper, we present an encoderless approach for DCC calibration which simultaneously estimates the kinematic parameters of the transformation chain as well as the unknown joint angles. We also demonstrate the integration of an encoderless gimbal mechanism with a state-of-the art VIO algorithm, and show the extensions required in order to perform simultaneous online estimation of the joint angles and vehicle localization state. The proposed calibration approach is validated both in simulation and on a physical DCC composed of a 2-DOF gimbal mounted on a UAV. Finally, we show the experimental results of the calibrated mechanism integrated into the OKVIS VIO package, and demonstrate successful online joint angle estimation while maintaining localization accuracy that is comparable to a standard static multi-camera configuration. △ Less

Submitted 24 July, 2018; originally announced July 2018.

Comments: ICRA 2018

arXiv:1807.06072 [pdf, other]

Leveraging Pre-Trained 3D Object Detection Models For Fast Ground Truth Generation

Authors: Jungwook Lee, Sean Walsh, Ali Harakeh, Steven L. Waslander

Abstract: Training 3D object detectors for autonomous driving has been limited to small datasets due to the effort required to generate annotations. Reducing both task complexity and the amount of task switching done by annotators is key to reducing the effort and time required to generate 3D bounding box annotations. This paper introduces a novel ground truth generation method that combines human supervisi… ▽ More Training 3D object detectors for autonomous driving has been limited to small datasets due to the effort required to generate annotations. Reducing both task complexity and the amount of task switching done by annotators is key to reducing the effort and time required to generate 3D bounding box annotations. This paper introduces a novel ground truth generation method that combines human supervision with pretrained neural networks to generate per-instance 3D point cloud segmentation, 3D bounding boxes, and class annotations. The annotators provide object anchor clicks which behave as a seed to generate instance segmentation results in 3D. The points belonging to each instance are then used to regress object centroids, bounding box dimensions, and object orientation. Our proposed annotation scheme requires 30x lower human annotation time. We use the KITTI 3D object detection dataset to evaluate the efficiency and the quality of our annotation scheme. We also test the the proposed scheme on previously unseen data from the Autonomoose self-driving vehicle to demonstrate generalization capabilities of the network. △ Less

Submitted 16 July, 2018; originally announced July 2018.

arXiv:1806.07987 [pdf, other]

A Hierarchical Deep Architecture and Mini-Batch Selection Method For Joint Traffic Sign and Light Detection

Authors: Alex D. Pon, Oles Andrienko, Ali Harakeh, Steven L. Waslander

Abstract: Traffic light and sign detectors on autonomous cars are integral for road scene perception. The literature is abundant with deep learning networks that detect either lights or signs, not both, which makes them unsuitable for real-life deployment due to the limited graphics processing unit (GPU) memory and power available on embedded systems. The root cause of this issue is that no public dataset c… ▽ More Traffic light and sign detectors on autonomous cars are integral for road scene perception. The literature is abundant with deep learning networks that detect either lights or signs, not both, which makes them unsuitable for real-life deployment due to the limited graphics processing unit (GPU) memory and power available on embedded systems. The root cause of this issue is that no public dataset contains both traffic light and sign labels, which leads to difficulties in developing a joint detection framework. We present a deep hierarchical architecture in conjunction with a mini-batch proposal selection mechanism that allows a network to detect both traffic lights and signs from training on separate traffic light and sign datasets. Our method solves the overlapping issue where instances from one dataset are not labelled in the other dataset. We are the first to present a network that performs joint detection on traffic lights and signs. We measure our network on the Tsinghua-Tencent 100K benchmark for traffic sign detection and the Bosch Small Traffic Lights benchmark for traffic light detection and show it outperforms the existing Bosch Small Traffic light state-of-the-art method. We focus on autonomous car deployment and show our network is more suitable than others because of its low memory footprint and real-time image processing time. Qualitative results can be viewed at https://youtu.be/_YmogPzBXOw △ Less

Submitted 13 September, 2018; v1 submitted 20 June, 2018; originally announced June 2018.

Comments: Accepted in the IEEE 15th Conference on Computer and Robot Vision

arXiv:1806.00526 [pdf, ps, other]

Multi-Step Prediction of Dynamic Systems with Recurrent Neural Networks

Authors: Nima Mohajerin, Steven L. Waslander

Abstract: Recurrent Neural Networks (RNNs) can encode rich dynamics which makes them suitable for modeling dynamic systems. To train an RNN for multi-step prediction of dynamic systems, it is crucial to efficiently address the state initialization problem, which seeks proper values for the RNN initial states at the beginning of each prediction interval. In this work, the state initialization problem is addr… ▽ More Recurrent Neural Networks (RNNs) can encode rich dynamics which makes them suitable for modeling dynamic systems. To train an RNN for multi-step prediction of dynamic systems, it is crucial to efficiently address the state initialization problem, which seeks proper values for the RNN initial states at the beginning of each prediction interval. In this work, the state initialization problem is addressed using Neural Networks (NNs) to effectively train a variety of RNNs for modeling two aerial vehicles, a helicopter and a quadrotor, from experimental data. It is shown that the RNN initialized by the NN-based initialization method outperforms the state of the art. Further, a comprehensive study of RNNs trained for multi-step prediction of the two aerial vehicles is presented. The multi-step prediction of the quadrotor is enhanced using a hybrid model which combines a simplified physics-based motion model of the vehicle with RNNs. While the maximum translational and rotational velocities in the quadrotor dataset are about 4 m/s and 3.8 rad/s, respectively, the hybrid model produces predictions, over 1.9 second, which remain within 9 cm/s and 0.12 rad/s of the measured translational and rotational velocities, with 99\% confidence on the test dataset △ Less

Submitted 19 May, 2018; originally announced June 2018.

arXiv:1805.01810 [pdf, other]

Manifold Geometry with Fast Automatic Derivatives and Coordinate Frame Semantics Checking in C++

Authors: Leonid Koppel, Steven L. Waslander

Abstract: Computer vision and robotics problems often require representation and estimation of poses on the SE(3) manifold. Developers of algorithms that must run in real time face several time-consuming programming tasks, including deriving and computing analytic derivatives and avoiding mathematical errors when handling poses in multiple coordinate frames. To support rapid and error-free development, we p… ▽ More Computer vision and robotics problems often require representation and estimation of poses on the SE(3) manifold. Developers of algorithms that must run in real time face several time-consuming programming tasks, including deriving and computing analytic derivatives and avoiding mathematical errors when handling poses in multiple coordinate frames. To support rapid and error-free development, we present wave_geometry, a C++ manifold geometry library with two key contributions: expression template-based automatic differentiation and compile-time enforcement of coordinate frame semantics. We contrast the library with existing open source packages and show that it can evaluate Jacobians in forward and reverse mode with little to no runtime overhead compared to hand-coded derivatives. The library is available at https://github.com/wavelab/wave_geometry . △ Less

Submitted 4 May, 2018; originally announced May 2018.

Comments: 8 pages, Conference on Computer and Robot Vision (CRV) 2018

arXiv:1802.00036 [pdf, other]

In Defense of Classical Image Processing: Fast Depth Completion on the CPU

Authors: Jason Ku, Ali Harakeh, Steven L. Waslander

Abstract: With the rise of data driven deep neural networks as a realization of universal function approximators, most research on computer vision problems has moved away from hand crafted classical image processing algorithms. This paper shows that with a well designed algorithm, we are capable of outperforming neural network based methods on the task of depth completion. The proposed algorithm is simple a… ▽ More With the rise of data driven deep neural networks as a realization of universal function approximators, most research on computer vision problems has moved away from hand crafted classical image processing algorithms. This paper shows that with a well designed algorithm, we are capable of outperforming neural network based methods on the task of depth completion. The proposed algorithm is simple and fast, runs on the CPU, and relies only on basic image processing operations to perform depth completion of sparse LIDAR depth data. We evaluate our algorithm on the challenging KITTI depth completion benchmark, and at the time of submission, our method ranks first on the KITTI test server among all published methods. Furthermore, our algorithm is data independent, requiring no training data to perform the task at hand. The code written in Python will be made publicly available at https://github.com/kujason/ip_basic. △ Less

Submitted 31 January, 2018; originally announced February 2018.

Showing 1–50 of 53 results for author: Waslander, S L