subscribe to arXiv mailings

OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects

Authors: Akshay Krishnan, Abhijit Kundu, Kevis-Kokitsi Maninis, James Hays, Matthew Brown

Abstract: We propose OmniNOCS, a large-scale monocular dataset with 3D Normalized Object Coordinate Space (NOCS) maps, object masks, and 3D bounding box annotations for indoor and outdoor scenes. OmniNOCS has 20 times more object classes and 200 times more instances than existing NOCS datasets (NOCS-Real275, Wild6D). We use OmniNOCS to train a novel, transformer-based monocular NOCS prediction model (NOCSfo… ▽ More We propose OmniNOCS, a large-scale monocular dataset with 3D Normalized Object Coordinate Space (NOCS) maps, object masks, and 3D bounding box annotations for indoor and outdoor scenes. OmniNOCS has 20 times more object classes and 200 times more instances than existing NOCS datasets (NOCS-Real275, Wild6D). We use OmniNOCS to train a novel, transformer-based monocular NOCS prediction model (NOCSformer) that can predict accurate NOCS, instance masks and poses from 2D object detections across diverse classes. It is the first NOCS model that can generalize to a broad range of classes when prompted with 2D boxes. We evaluate our model on the task of 3D oriented bounding box prediction, where it achieves comparable results to state-of-the-art 3D detection methods such as Cube R-CNN. Unlike other 3D detection methods, our model also provides detailed and accurate 3D object shape and segmentation. We propose a novel benchmark for the task of NOCS prediction based on OmniNOCS, which we hope will serve as a useful baseline for future work in this area. Our dataset and code will be at the project website: https://omninocs.github.io. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024, project website: https://omninocs.github.io

arXiv:2407.04952 [pdf, other]

Granular Privacy Control for Geolocation with Vision Language Models

Authors: Ethan Mendes, Yang Chen, James Hays, Sauvik Das, Wei Xu, Alan Ritter

Abstract: Vision Language Models (VLMs) are rapidly advancing in their capability to answer information-seeking questions. As these models are widely deployed in consumer applications, they could lead to new privacy risks due to emergent abilities to identify people in photos, geolocate images, etc. As we demonstrate, somewhat surprisingly, current open-source and proprietary VLMs are very capable image geo… ▽ More Vision Language Models (VLMs) are rapidly advancing in their capability to answer information-seeking questions. As these models are widely deployed in consumer applications, they could lead to new privacy risks due to emergent abilities to identify people in photos, geolocate images, etc. As we demonstrate, somewhat surprisingly, current open-source and proprietary VLMs are very capable image geolocators, making widespread geolocation with VLMs an immediate privacy risk, rather than merely a theoretical future concern. As a first step to address this challenge, we develop a new benchmark, GPTGeoChat, to test the ability of VLMs to moderate geolocation dialogues with users. We collect a set of 1,000 image geolocation conversations between in-house annotators and GPT-4v, which are annotated with the granularity of location information revealed at each turn. Using this new dataset, we evaluate the ability of various VLMs to moderate GPT-4v geolocation conversations by determining when too much location information has been revealed. We find that custom fine-tuned models perform on par with prompted API-based models when identifying leaked location information at the country or city level; however, fine-tuning on supervised data appears to be needed to accurately moderate finer granularities, such as the name of a restaurant or building. △ Less

Submitted 6 July, 2024; originally announced July 2024.

arXiv:2406.19390 [pdf, other]

SALVe: Semantic Alignment Verification for Floorplan Reconstruction from Sparse Panoramas

Authors: John Lambert, Yuguang Li, Ivaylo Boyadzhiev, Lambert Wixson, Manjunath Narayana, Will Hutchcroft, James Hays, Frank Dellaert, Sing Bing Kang

Abstract: We propose a new system for automatic 2D floorplan reconstruction that is enabled by SALVe, our novel pairwise learned alignment verifier. The inputs to our system are sparsely located 360$^\circ$ panoramas, whose semantic features (windows, doors, and openings) are inferred and used to hypothesize pairwise room adjacency or overlap. SALVe initializes a pose graph, which is subsequently optimized… ▽ More We propose a new system for automatic 2D floorplan reconstruction that is enabled by SALVe, our novel pairwise learned alignment verifier. The inputs to our system are sparsely located 360$^\circ$ panoramas, whose semantic features (windows, doors, and openings) are inferred and used to hypothesize pairwise room adjacency or overlap. SALVe initializes a pose graph, which is subsequently optimized using GTSAM. Once the room poses are computed, room layouts are inferred using HorizonNet, and the floorplan is constructed by stitching the most confident layout boundaries. We validate our system qualitatively and quantitatively as well as through ablation studies, showing that it outperforms state-of-the-art SfM systems in completeness by over 200%, without sacrificing accuracy. Our results point to the significance of our work: poses of 81% of panoramas are localized in the first 2 connected components (CCs), and 89% in the first 3 CCs. Code and models are publicly available at https://github.com/zillow/salve. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: Accepted at ECCV 2022

arXiv:2406.10115 [pdf, other]

Shelf-Supervised Multi-Modal Pre-Training for 3D Object Detection

Authors: Mehar Khurana, Neehar Peri, Deva Ramanan, James Hays

Abstract: State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised… ▽ More State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised learning from the image domain to point clouds (such as contrastive learning). However, publicly available 3D datasets are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting their effectiveness. We do note, however, that such data is naturally collected in a multimodal fashion, often paired with images. Rather than pre-training with only self-supervised objectives, we argue that it is better to bootstrap point cloud representations using image-based foundation models trained on internet-scale image data. Specifically, we propose a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data. Pre-training 3D detectors with such pseudo-labels yields significantly better semi-supervised detection accuracy than prior self-supervised pretext tasks. Importantly, we show that image-based shelf-supervision is helpful for training LiDAR-only and multi-modal (RGB + LiDAR) detectors. We demonstrate the effectiveness of our approach on nuScenes and WOD, significantly improving over prior work in limited data settings. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2405.12978 [pdf, other]

Personalized Residuals for Concept-Driven Text-to-Image Generation

Authors: Cusuh Ham, Matthew Fisher, James Hays, Nicholas Kolkin, Yuchen Liu, Richard Zhang, Tobias Hinz

Abstract: We present personalized residuals and localized attention-guided sampling for efficient concept-driven generation using text-to-image diffusion models. Our method first represents concepts by freezing the weights of a pretrained text-conditioned diffusion model and learning low-rank residuals for a small subset of the model's layers. The residual-based approach then directly enables application of… ▽ More We present personalized residuals and localized attention-guided sampling for efficient concept-driven generation using text-to-image diffusion models. Our method first represents concepts by freezing the weights of a pretrained text-conditioned diffusion model and learning low-rank residuals for a small subset of the model's layers. The residual-based approach then directly enables application of our proposed sampling technique, which applies the learned residuals only in areas where the concept is localized via cross-attention and applies the original diffusion weights in all other regions. Localized sampling therefore combines the learned identity of the concept with the existing generative prior of the underlying diffusion model. We show that personalized residuals effectively capture the identity of a concept in ~3 minutes on a single GPU without the use of regularization images and with fewer parameters than previous models, and localized sampling allows using the original model as strong prior for large parts of the image. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: CVPR 2024. Project page at https://cusuh.github.io/personalized-residuals

arXiv:2403.04739 [pdf, other]

I Can't Believe It's Not Scene Flow!

Authors: Ishan Khatri, Kyle Vedder, Neehar Peri, Deva Ramanan, James Hays

Abstract: Current scene flow methods broadly fail to describe motion on small objects, and current scene flow evaluation protocols hide this failure by averaging over many points, with most drawn larger objects. To fix this evaluation failure, we propose a new evaluation protocol, Bucket Normalized EPE, which is class-aware and speed-normalized, enabling contextualized error comparisons between object types… ▽ More Current scene flow methods broadly fail to describe motion on small objects, and current scene flow evaluation protocols hide this failure by averaging over many points, with most drawn larger objects. To fix this evaluation failure, we propose a new evaluation protocol, Bucket Normalized EPE, which is class-aware and speed-normalized, enabling contextualized error comparisons between object types that move at vastly different speeds. To highlight current method failures, we propose a frustratingly simple supervised scene flow baseline, TrackFlow, built by bolting a high-quality pretrained detector (trained using many class rebalancing techniques) onto a simple tracker, that produces state-of-the-art performance on current standard evaluations and large improvements over prior art on our new evaluation. Our results make it clear that all scene flow evaluations must be class and speed aware, and supervised scene flow methods must address point class imbalances. We release the evaluation code publicly at https://github.com/kylevedder/BucketedSceneFlowEval. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: 13 pages, 3 pages of citations, 2 pages of supplemental

arXiv:2310.12464 [pdf, other]

Lidar Panoptic Segmentation and Tracking without Bells and Whistles

Authors: Abhinav Agarwalla, Xuhua Huang, Jason Ziglar, Francesco Ferroni, Laura Leal-Taixé, James Hays, Aljoša Ošep, Deva Ramanan

Abstract: State-of-the-art lidar panoptic segmentation (LPS) methods follow bottom-up segmentation-centric fashion wherein they build upon semantic segmentation networks by utilizing clustering to obtain object instances. In this paper, we re-think this approach and propose a surprisingly simple yet effective detection-centric network for both LPS and tracking. Our network is modular by design and optimized… ▽ More State-of-the-art lidar panoptic segmentation (LPS) methods follow bottom-up segmentation-centric fashion wherein they build upon semantic segmentation networks by utilizing clustering to obtain object instances. In this paper, we re-think this approach and propose a surprisingly simple yet effective detection-centric network for both LPS and tracking. Our network is modular by design and optimized for all aspects of both the panoptic segmentation and tracking task. One of the core components of our network is the object instance detection branch, which we train using point-level (modal) annotations, as available in segmentation-centric datasets. In the absence of amodal (cuboid) annotations, we regress modal centroids and object extent using trajectory-level supervision that provides information about object size, which cannot be inferred from single scans due to occlusions and the sparse nature of the lidar data. We obtain fine-grained instance segments by learning to associate lidar points with detected centroids. We evaluate our method on several 3D/4D LPS benchmarks and observe that our model establishes a new state-of-the-art among open-sourced models, outperforming recent query-based models. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: IROS 2023. Code at https://github.com/abhinavagarwalla/most-lps

arXiv:2310.03743 [pdf, other]

The Un-Kidnappable Robot: Acoustic Localization of Sneaking People

Authors: Mengyu Yang, Patrick Grady, Samarth Brahmbhatt, Arun Balajee Vasudevan, Charles C. Kemp, James Hays

Abstract: How easy is it to sneak up on a robot? We examine whether we can detect people using only the incidental sounds they produce as they move, even when they try to be quiet. We collect a robotic dataset of high-quality 4-channel audio paired with 360 degree RGB data of people moving in different indoor settings. We train models that predict if there is a moving person nearby and their location using… ▽ More How easy is it to sneak up on a robot? We examine whether we can detect people using only the incidental sounds they produce as they move, even when they try to be quiet. We collect a robotic dataset of high-quality 4-channel audio paired with 360 degree RGB data of people moving in different indoor settings. We train models that predict if there is a moving person nearby and their location using only audio. We implement our method on a robot, allowing it to track a single person moving quietly with only passive audio sensing. For demonstration videos, see our project page: https://sites.google.com/view/unkidnappable-robot △ Less

Submitted 9 May, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

Comments: ICRA 2024 camera ready

arXiv:2309.04605 [pdf, other]

Evaluating Total Environmental Impact for a Computing Infrastructure

Authors: Adrian Jackson, Jon Hays, Alex Owen, Nicholas Walton, Alison Packer, Anish Mudaraddi

Abstract: In this paper we outline the results of a project to evaluate the total climate/carbon impact of a digital research infrastructure for a defined snapshot period. We outline the carbon model used to calculate the impact and the data collected to quantify that impact for a defined set of resources. We discuss the variation in potential impact across both the active and embodied carbon for computing… ▽ More In this paper we outline the results of a project to evaluate the total climate/carbon impact of a digital research infrastructure for a defined snapshot period. We outline the carbon model used to calculate the impact and the data collected to quantify that impact for a defined set of resources. We discuss the variation in potential impact across both the active and embodied carbon for computing hardware and produce a range of estimates on the amount of carbon equivalent climate impact for the snapshot period. △ Less

Submitted 8 September, 2023; originally announced September 2023.

arXiv:2309.01202 [pdf, other]

MAGMA: Music Aligned Generative Motion Autodecoder

Authors: Sohan Anisetty, Amit Raj, James Hays

Abstract: Mapping music to dance is a challenging problem that requires spatial and temporal coherence along with a continual synchronization with the music's progression. Taking inspiration from large language models, we introduce a 2-step approach for generating dance using a Vector Quantized-Variational Autoencoder (VQ-VAE) to distill motion into primitives and train a Transformer decoder to learn the co… ▽ More Mapping music to dance is a challenging problem that requires spatial and temporal coherence along with a continual synchronization with the music's progression. Taking inspiration from large language models, we introduce a 2-step approach for generating dance using a Vector Quantized-Variational Autoencoder (VQ-VAE) to distill motion into primitives and train a Transformer decoder to learn the correct sequencing of these primitives. We also evaluate the importance of music representations by comparing naive music feature extraction using Librosa to deep audio representations generated by state-of-the-art audio compression algorithms. Additionally, we train variations of the motion generator using relative and absolute positional encodings to determine the effect on generated motion quality when generating arbitrarily long sequence lengths. Our proposed approach achieve state-of-the-art results in music-to-motion generation benchmarks and enables the real-time generation of considerably longer motion sequences, the ability to chain multiple motion sequences seamlessly, and easy customization of motion sequences to meet style requirements. △ Less

Submitted 3 September, 2023; originally announced September 2023.

arXiv:2308.15268 [pdf, other]

Collision-Free Inverse Kinematics Through QP Optimization (iKinQP)

Authors: Julia Ashkanazy, Ariana Spalter, Joe Hays, Laura Hiatt, Roxana Leontie, C. Glen Henshaw

Abstract: Robotic manipulators are often designed with more actuated degrees-of-freedom than required to fully control an end effector's position and orientation. These "redundant" manipulators can allow infinite joint configurations that satisfy a particular task-space position and orientation, providing more possibilities for the manipulator to traverse a smooth collision-free trajectory. However, finding… ▽ More Robotic manipulators are often designed with more actuated degrees-of-freedom than required to fully control an end effector's position and orientation. These "redundant" manipulators can allow infinite joint configurations that satisfy a particular task-space position and orientation, providing more possibilities for the manipulator to traverse a smooth collision-free trajectory. However, finding such a trajectory is non-trivial because the inverse kinematics for redundant manipulators cannot typically be solved analytically. Many strategies have been developed to tackle this problem, including Jacobian pseudo-inverse method, rapidly-expanding-random tree (RRT) motion planning, and quadratic programming (QP) based methods. Here, we present a flexible inverse kinematics-based QP strategy (iKinQP). Because it is independent of robot dynamics, the algorithm is relatively light-weight, and able to run in real-time in step with torque control. Collisions are defined as kinematic trees of elementary geometries, making the algorithm agnostic to the method used to determine what collisions are in the environment. Collisions are treated as hard constraints which guarantees the generation of collision-free trajectories. Trajectory smoothness is accomplished through the QP optimization. Our algorithm was evaluated for computational efficiency, smoothness, and its ability to provide trackable trajectories. It was shown that iKinQP is capable of providing smooth, collision-free trajectories at real-time rates. △ Less

Submitted 29 August, 2023; originally announced August 2023.

Comments: 9 pages, 8 figures, 2 tables

arXiv:2308.09105 [pdf, other]

Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

Authors: Shengcao Cao, Mengtian Li, James Hays, Deva Ramanan, Yi-Xiong Wang, Liang-Yan Gui

Abstract: Resource-constrained perception systems such as edge computing and vision-for-robotics require vision models to be both accurate and lightweight in computation and memory usage. While knowledge distillation is a proven strategy to enhance the performance of lightweight classification models, its application to structured outputs like object detection and instance segmentation remains a complicated… ▽ More Resource-constrained perception systems such as edge computing and vision-for-robotics require vision models to be both accurate and lightweight in computation and memory usage. While knowledge distillation is a proven strategy to enhance the performance of lightweight classification models, its application to structured outputs like object detection and instance segmentation remains a complicated task, due to the variability in outputs and complex internal network modules involved in the distillation process. In this paper, we propose a simple yet surprisingly effective sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student. To distill knowledge from a highly accurate but complex teacher model, we construct a sequence of teachers to help the student gradually adapt. Our progressive strategy can be easily combined with existing detection distillation mechanisms to consistently maximize student performance in various settings. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students, and unprecedentedly boost the performance of ResNet-50 based RetinaNet from 36.5% to 42.0% AP and Mask R-CNN from 38.2% to 42.5% AP on the MS COCO benchmark. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: ICML 2023

arXiv:2308.04054 [pdf, other]

An Empirical Analysis of Range for 3D Object Detection

Authors: Neehar Peri, Mengtian Li, Benjamin Wilson, Yu-Xiong Wang, James Hays, Deva Ramanan

Abstract: LiDAR-based 3D detection plays a vital role in autonomous navigation. Surprisingly, although autonomous vehicles (AVs) must detect both near-field objects (for collision avoidance) and far-field objects (for longer-term planning), contemporary benchmarks focus only on near-field 3D detection. However, AVs must detect far-field objects for safe navigation. In this paper, we present an empirical ana… ▽ More LiDAR-based 3D detection plays a vital role in autonomous navigation. Surprisingly, although autonomous vehicles (AVs) must detect both near-field objects (for collision avoidance) and far-field objects (for longer-term planning), contemporary benchmarks focus only on near-field 3D detection. However, AVs must detect far-field objects for safe navigation. In this paper, we present an empirical analysis of far-field 3D detection using the long-range detection dataset Argoverse 2.0 to better understand the problem, and share the following insight: near-field LiDAR measurements are dense and optimally encoded by small voxels, while far-field measurements are sparse and are better encoded with large voxels. We exploit this observation to build a collection of range experts tuned for near-vs-far field detection, and propose simple techniques to efficiently ensemble models for long-range detection that improve efficiency by 33% and boost accuracy by 3.2% CDS. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: Accepted to ICCV 2023 Workshop - Robustness and Reliability of Autonomous Vehicles in the Open-World

arXiv:2306.01906 [pdf, other]

Synaptic motor adaptation: A three-factor learning rule for adaptive robotic control in spiking neural networks

Authors: Samuel Schmidgall, Joe Hays

Abstract: Legged robots operating in real-world environments must possess the ability to rapidly adapt to unexpected conditions, such as changing terrains and varying payloads. This paper introduces the Synaptic Motor Adaptation (SMA) algorithm, a novel approach to achieving real-time online adaptation in quadruped robots through the utilization of neuroscience-derived rules of synaptic plasticity with thre… ▽ More Legged robots operating in real-world environments must possess the ability to rapidly adapt to unexpected conditions, such as changing terrains and varying payloads. This paper introduces the Synaptic Motor Adaptation (SMA) algorithm, a novel approach to achieving real-time online adaptation in quadruped robots through the utilization of neuroscience-derived rules of synaptic plasticity with three-factor learning. To facilitate rapid adaptation, we meta-optimize a three-factor learning rule via gradient descent to adapt to uncertainty by approximating an embedding produced by privileged information using only locally accessible onboard sensing data. Our algorithm performs similarly to state-of-the-art motor adaptation algorithms and presents a clear path toward achieving adaptive robotics with neuromorphic hardware. △ Less

Submitted 2 June, 2023; originally announced June 2023.

arXiv:2305.10424 [pdf, other]

ZeroFlow: Scalable Scene Flow via Distillation

Authors: Kyle Vedder, Neehar Peri, Nathaniel Chodosh, Ishan Khatri, Eric Eaton, Dinesh Jayaraman, Yang Liu, Deva Ramanan, James Hays

Abstract: Scene flow estimation is the task of describing the 3D motion field between temporally successive point clouds. State-of-the-art methods use strong priors and test-time optimization techniques, but require on the order of tens of seconds to process full-size point clouds, making them unusable as computer vision primitives for real-time applications such as open world object detection. Feedforward… ▽ More Scene flow estimation is the task of describing the 3D motion field between temporally successive point clouds. State-of-the-art methods use strong priors and test-time optimization techniques, but require on the order of tens of seconds to process full-size point clouds, making them unusable as computer vision primitives for real-time applications such as open world object detection. Feedforward methods are considerably faster, running on the order of tens to hundreds of milliseconds for full-size point clouds, but require expensive human supervision. To address both limitations, we propose Scene Flow via Distillation, a simple, scalable distillation framework that uses a label-free optimization method to produce pseudo-labels to supervise a feedforward model. Our instantiation of this framework, ZeroFlow, achieves state-of-the-art performance on the Argoverse 2 Self-Supervised Scene Flow Challenge while using zero human labels by simply training on large-scale, diverse unlabeled data. At test-time, ZeroFlow is over 1000x faster than label-free state-of-the-art optimization-based methods on full-size point clouds (34 FPS vs 0.028 FPS) and over 1000x cheaper to train on unlabeled data compared to the cost of human annotation (\$394 vs ~\$750,000). To facilitate further research, we release our code, trained model weights, and high quality pseudo-labels for the Argoverse 2 and Waymo Open datasets at https://vedder.io/zeroflow.html △ Less

Submitted 14 March, 2024; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: Accepted to ICLR 2024. 9 pages, 4 pages of citations, 6 pages of Supplemental. Project page with data releases is at http://vedder.io/zeroflow.html

arXiv:2304.03280 [pdf, other]

LANe: Lighting-Aware Neural Fields for Compositional Scene Synthesis

Authors: Akshay Krishnan, Amit Raj, Xianling Zhang, Alexandra Carlson, Nathan Tseng, Sandhya Sridhar, Nikita Jaipuria, James Hays

Abstract: Neural fields have recently enjoyed great success in representing and rendering 3D scenes. However, most state-of-the-art implicit representations model static or dynamic scenes as a whole, with minor variations. Existing work on learning disentangled world and object neural fields do not consider the problem of composing objects into different world neural fields in a lighting-aware manner. We pr… ▽ More Neural fields have recently enjoyed great success in representing and rendering 3D scenes. However, most state-of-the-art implicit representations model static or dynamic scenes as a whole, with minor variations. Existing work on learning disentangled world and object neural fields do not consider the problem of composing objects into different world neural fields in a lighting-aware manner. We present Lighting-Aware Neural Field (LANe) for the compositional synthesis of driving scenes in a physically consistent manner. Specifically, we learn a scene representation that disentangles the static background and transient elements into a world-NeRF and class-specific object-NeRFs to allow compositional synthesis of multiple objects in the scene. Furthermore, we explicitly designed both the world and object models to handle lighting variation, which allows us to compose objects into scenes with spatially varying lighting. This is achieved by constructing a light field of the scene and using it in conjunction with a learned shader to modulate the appearance of the object NeRFs. We demonstrate the performance of our model on a synthetic dataset of diverse lighting conditions rendered with the CARLA simulator, as well as a novel real-world dataset of cars collected at different times of the day. Our approach shows that it outperforms state-of-the-art compositional scene synthesis on the challenging dataset setup, via composing object-NeRFs learned from one scene into an entirely different scene whilst still respecting the lighting variations in the novel scene. For more results, please visit our project website https://lane-composition.github.io/. △ Less

Submitted 6 April, 2023; originally announced April 2023.

Comments: Project website: https://lane-composition.github.io

arXiv:2302.12764 [pdf, other]

Modulating Pretrained Diffusion Models for Multimodal Image Synthesis

Authors: Cusuh Ham, James Hays, Jingwan Lu, Krishna Kumar Singh, Zhifei Zhang, Tobias Hinz

Abstract: We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but \textit{does not require any updat… ▽ More We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but \textit{does not require any updates to the diffusion network's parameters}. MCM is a small module trained to modulate the diffusion network's predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only $\sim$1$\%$ of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples. We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs. △ Less

Submitted 18 May, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

Comments: SIGGRAPH Conference Proceedings 2023. Project page at https://mcm-diffusion.github.io

arXiv:2301.02310 [pdf, other]

PressureVision++: Estimating Fingertip Pressure from Diverse RGB Images

Authors: Patrick Grady, Jeremy A. Collins, Chengcheng Tang, Christopher D. Twigg, Kunal Aneja, James Hays, Charles C. Kemp

Abstract: Touch plays a fundamental role in manipulation for humans; however, machine perception of contact and pressure typically requires invasive sensors. Recent research has shown that deep models can estimate hand pressure based on a single RGB image. However, evaluations have been limited to controlled settings since collecting diverse data with ground-truth pressure measurements is difficult. We pres… ▽ More Touch plays a fundamental role in manipulation for humans; however, machine perception of contact and pressure typically requires invasive sensors. Recent research has shown that deep models can estimate hand pressure based on a single RGB image. However, evaluations have been limited to controlled settings since collecting diverse data with ground-truth pressure measurements is difficult. We present a novel approach that enables diverse data to be captured with only an RGB camera and a cooperative participant. Our key insight is that people can be prompted to apply pressure in a certain way, and this prompt can serve as a weak label to supervise models to perform well under varied conditions. We collect a novel dataset with 51 participants making fingertip contact with diverse objects. Our network, PressureVision++, outperforms human annotators and prior work. We also demonstrate an application of PressureVision++ to mixed reality where pressure estimation allows everyday surfaces to be used as arbitrary touch-sensitive interfaces. Code, data, and models are available online. △ Less

Submitted 3 January, 2024; v1 submitted 5 January, 2023; originally announced January 2023.

Comments: WACV 2024

arXiv:2301.00493 [pdf, other]

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

Authors: Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, James Hays

Abstract: We introduce Argoverse 2 (AV2) - a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26… ▽ More We introduce Argoverse 2 (AV2) - a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene. Models are tasked with the prediction of future motion for "scored actors" in each scenario and are provided with track histories that capture object location, heading, velocity, and category. In all three datasets, each scenario contains its own HD Map with 3D lane and crosswalk geometry - sourced from data captured in six distinct cities. We believe these datasets will support new and existing machine learning research problems in ways that existing datasets do not. All datasets are released under the CC BY-NC-SA 4.0 license. △ Less

Submitted 1 January, 2023; originally announced January 2023.

Comments: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks

arXiv:2212.07312 [pdf, other]

Trust, but Verify: Cross-Modality Fusion for HD Map Change Detection

Authors: John Lambert, James Hays

Abstract: High-definition (HD) map change detection is the task of determining when sensor data and map data are no longer in agreement with one another due to real-world changes. We collect the first dataset for the task, which we entitle the Trust, but Verify (TbV) dataset, by mining thousands of hours of data from over 9 months of autonomous vehicle fleet operations. We present learning-based formulation… ▽ More High-definition (HD) map change detection is the task of determining when sensor data and map data are no longer in agreement with one another due to real-world changes. We collect the first dataset for the task, which we entitle the Trust, but Verify (TbV) dataset, by mining thousands of hours of data from over 9 months of autonomous vehicle fleet operations. We present learning-based formulations for solving the problem in the bird's eye view and ego-view. Because real map changes are infrequent and vector maps are easy to synthetically manipulate, we lean on simulated data to train our model. Perhaps surprisingly, we show that such models can generalize to real world distributions. The dataset, consisting of maps and logs collected in six North American cities, is one of the largest AV datasets to date with more than 7.8 million images. We make the data available to the public at https://www.argoverse.org/av2.html#mapchange-link, along with code and models at https://github.com/johnwlambert/tbv under the the CC BY-NC-SA 4.0 license. △ Less

Submitted 14 December, 2022; originally announced December 2022.

Comments: NeurIPS 2021, Track on Datasets and Benchmarks. Project page: https://tbv-dataset.github.io/

arXiv:2211.13858 [pdf, other]

Far3Det: Towards Far-Field 3D Detection

Authors: Shubham Gupta, Jeet Kanjani, Mengtian Li, Francesco Ferroni, James Hays, Deva Ramanan, Shu Kong

Abstract: We focus on the task of far-field 3D detection (Far3Det) of objects beyond a certain distance from an observer, e.g., $>$50m. Far3Det is particularly important for autonomous vehicles (AVs) operating at highway speeds, which require detections of far-field obstacles to ensure sufficient braking distances. However, contemporary AV benchmarks such as nuScenes underemphasize this problem because they… ▽ More We focus on the task of far-field 3D detection (Far3Det) of objects beyond a certain distance from an observer, e.g., $>$50m. Far3Det is particularly important for autonomous vehicles (AVs) operating at highway speeds, which require detections of far-field obstacles to ensure sufficient braking distances. However, contemporary AV benchmarks such as nuScenes underemphasize this problem because they evaluate performance only up to a certain distance (50m). One reason is that obtaining far-field 3D annotations is difficult, particularly for lidar sensors that produce very few point returns for far-away objects. Indeed, we find that almost 50% of far-field objects (beyond 50m) contain zero lidar points. Secondly, current metrics for 3D detection employ a "one-size-fits-all" philosophy, using the same tolerance thresholds for near and far objects, inconsistent with tolerances for both human vision and stereo disparities. Both factors lead to an incomplete analysis of the Far3Det task. For example, while conventional wisdom tells us that high-resolution RGB sensors should be vital for 3D detection of far-away objects, lidar-based methods still rank higher compared to RGB counterparts on the current benchmark leaderboards. As a first step towards a Far3Det benchmark, we develop a method to find well-annotated scenes from the nuScenes dataset and derive a well-annotated far-field validation set. We also propose a Far3Det evaluation protocol and explore various 3D detection methods for Far3Det. Our result convincingly justifies the long-held conventional wisdom that high-resolution RGB improves 3D detection in the far-field. We further propose a simple yet effective method that fuses detections from RGB and lidar detectors based on non-maximum suppression, which remarkably outperforms state-of-the-art 3D detectors in the far-field. △ Less

Submitted 24 November, 2022; originally announced November 2022.

Comments: WACV 2023 12 Pages, 8 Figures, 10 Tables

arXiv:2211.04625 [pdf, other]

Soft Augmentation for Image Classification

Authors: Yang Liu, Shen Yan, Laura Leal-Taixé, James Hays, Deva Ramanan

Abstract: Modern neural networks are over-parameterized and thus rely on strong regularization such as data augmentation and weight decay to reduce overfitting and improve generalization. The dominant form of data augmentation applies invariant transforms, where the learning target of a sample is invariant to the transform applied to that sample. We draw inspiration from human visual classification studies… ▽ More Modern neural networks are over-parameterized and thus rely on strong regularization such as data augmentation and weight decay to reduce overfitting and improve generalization. The dominant form of data augmentation applies invariant transforms, where the learning target of a sample is invariant to the transform applied to that sample. We draw inspiration from human visual classification studies and propose generalizing augmentation with invariant transforms to soft augmentation where the learning target softens non-linearly as a function of the degree of the transform applied to the sample: e.g., more aggressive image crop augmentations produce less confident learning targets. We demonstrate that soft targets allow for more aggressive data augmentation, offer more robust performance boosts, work with other augmentation policies, and interestingly, produce better calibrated models (since they are trained to be less confident on aggressively cropped/occluded examples). Combined with existing aggressive augmentation strategies, soft target 1) doubles the top-1 accuracy boost across Cifar-10, Cifar-100, ImageNet-1K, and ImageNet-V2, 2) improves model occlusion performance by up to $4\times$, and 3) halves the expected calibration error (ECE). Finally, we show that soft augmentation generalizes to self-supervised classification tasks. Code available at https://github.com/youngleox/soft_augmentation △ Less

Submitted 23 January, 2024; v1 submitted 8 November, 2022; originally announced November 2022.

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023 (pp. 16241-16250)

arXiv:2208.03354 [pdf, other]

A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch

Authors: Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, James Hays

Abstract: We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASK-former follows the late-fusion dual-enco… ▽ More We address the problem of retrieving images with both a sketch and a text query. We present TASK-former (Text And SKetch transformer), an end-to-end trainable model for image retrieval using a text description and a sketch as input. We argue that both input modalities complement each other in a manner that cannot be achieved easily by either one alone. TASK-former follows the late-fusion dual-encoder approach, similar to CLIP, which allows efficient and scalable retrieval since the retrieval set can be indexed independently of the queries. We empirically demonstrate that using an input sketch (even a poorly drawn one) in addition to text considerably increases retrieval recall compared to traditional text-based image retrieval. To evaluate our approach, we collect 5,000 hand-drawn sketches for images in the test set of the COCO dataset. The collected sketches are available a https://janesjanes.github.io/tsbir/. △ Less

Submitted 5 August, 2022; originally announced August 2022.

Comments: ECCV 2022

arXiv:2206.12520 [pdf, other]

Learning to learn online with neuromodulated synaptic plasticity in spiking neural networks

Authors: Samuel Schmidgall, Joe Hays

Abstract: We propose that in order to harness our understanding of neuroscience toward machine learning, we must first have powerful tools for training brain-like models of learning. Although substantial progress has been made toward understanding the dynamics of learning in the brain, neuroscience-derived models of learning have yet to demonstrate the same performance capabilities as methods in deep learni… ▽ More We propose that in order to harness our understanding of neuroscience toward machine learning, we must first have powerful tools for training brain-like models of learning. Although substantial progress has been made toward understanding the dynamics of learning in the brain, neuroscience-derived models of learning have yet to demonstrate the same performance capabilities as methods in deep learning such as gradient descent. Inspired by the successes of machine learning using gradient descent, we demonstrate that models of neuromodulated synaptic plasticity from neuroscience can be trained in Spiking Neural Networks (SNNs) with a framework of learning to learn through gradient descent to address challenging online learning problems. This framework opens a new path toward developing neuroscience inspired online learning algorithms. △ Less

Submitted 27 June, 2022; v1 submitted 24 June, 2022; originally announced June 2022.

arXiv:2204.07268 [pdf, other]

Visual Pressure Estimation and Control for Soft Robotic Grippers

Authors: Patrick Grady, Jeremy A. Collins, Samarth Brahmbhatt, Christopher D. Twigg, Chengcheng Tang, James Hays, Charles C. Kemp

Abstract: Soft robotic grippers facilitate contact-rich manipulation, including robust grasping of varied objects. Yet the beneficial compliance of a soft gripper also results in significant deformation that can make precision manipulation challenging. We present visual pressure estimation & control (VPEC), a method that infers pressure applied by a soft gripper using an RGB image from an external camera. W… ▽ More Soft robotic grippers facilitate contact-rich manipulation, including robust grasping of varied objects. Yet the beneficial compliance of a soft gripper also results in significant deformation that can make precision manipulation challenging. We present visual pressure estimation & control (VPEC), a method that infers pressure applied by a soft gripper using an RGB image from an external camera. We provide results for visual pressure inference when a pneumatic gripper and a tendon-actuated gripper make contact with a flat surface. We also show that VPEC enables precision manipulation via closed-loop control of inferred pressure images. In our evaluation, a mobile manipulator (Stretch RE1 from Hello Robot) uses visual servoing to make contact at a desired pressure; follow a spatial pressure trajectory; and grasp small low-profile objects, including a microSD card, a penny, and a pill. Overall, our results show that visual estimates of applied pressure can enable a soft gripper to perform precision manipulation. △ Less

Submitted 9 August, 2022; v1 submitted 14 April, 2022; originally announced April 2022.

Comments: IROS 2022

arXiv:2203.15798 [pdf, other]

DRaCoN -- Differentiable Rasterization Conditioned Neural Radiance Fields for Articulated Avatars

Authors: Amit Raj, Umar Iqbal, Koki Nagano, Sameh Khamis, Pavlo Molchanov, James Hays, Jan Kautz

Abstract: Acquisition and creation of digital human avatars is an important problem with applications to virtual telepresence, gaming, and human modeling. Most contemporary approaches for avatar generation can be viewed either as 3D-based methods, which use multi-view data to learn a 3D representation with appearance (such as a mesh, implicit surface, or volume), or 2D-based methods which learn photo-realis… ▽ More Acquisition and creation of digital human avatars is an important problem with applications to virtual telepresence, gaming, and human modeling. Most contemporary approaches for avatar generation can be viewed either as 3D-based methods, which use multi-view data to learn a 3D representation with appearance (such as a mesh, implicit surface, or volume), or 2D-based methods which learn photo-realistic renderings of avatars but lack accurate 3D representations. In this work, we present, DRaCoN, a framework for learning full-body volumetric avatars which exploits the advantages of both the 2D and 3D neural rendering techniques. It consists of a Differentiable Rasterization module, DiffRas, that synthesizes a low-resolution version of the target image along with additional latent features guided by a parametric body model. The output of DiffRas is then used as conditioning to our conditional neural 3D representation module (c-NeRF) which generates the final high-res image along with body geometry using volumetric rendering. While DiffRas helps in obtaining photo-realistic image quality, c-NeRF, which employs signed distance fields (SDF) for 3D representations, helps to obtain fine 3D geometric details. Experiments on the challenging ZJU-MoCap and Human3.6M datasets indicate that DRaCoN outperforms state-of-the-art methods both in terms of error metrics and visual quality. △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: Project page at https://dracon-avatars.github.io/

arXiv:2203.10385 [pdf, other]

PressureVision: Estimating Hand Pressure from a Single RGB Image

Authors: Patrick Grady, Chengcheng Tang, Samarth Brahmbhatt, Christopher D. Twigg, Chengde Wan, James Hays, Charles C. Kemp

Abstract: People often interact with their surroundings by applying pressure with their hands. While hand pressure can be measured by placing pressure sensors between the hand and the environment, doing so can alter contact mechanics, interfere with human tactile perception, require costly sensors, and scale poorly to large environments. We explore the possibility of using a conventional RGB camera to infer… ▽ More People often interact with their surroundings by applying pressure with their hands. While hand pressure can be measured by placing pressure sensors between the hand and the environment, doing so can alter contact mechanics, interfere with human tactile perception, require costly sensors, and scale poorly to large environments. We explore the possibility of using a conventional RGB camera to infer hand pressure, enabling machine perception of hand pressure from uninstrumented hands and surfaces. The central insight is that the application of pressure by a hand results in informative appearance changes. Hands share biomechanical properties that result in similar observable phenomena, such as soft-tissue deformation, blood distribution, hand pose, and cast shadows. We collected videos of 36 participants with diverse skin tone applying pressure to an instrumented planar surface. We then trained a deep model (PressureVisionNet) to infer a pressure image from a single RGB image. Our model infers pressure for participants outside of the training data and outperforms baselines. We also show that the output of our model depends on the appearance of the hand and cast shadows near contact regions. Overall, our results suggest the appearance of a previously unobserved human hand can be used to accurately infer applied pressure. Data, code, and models are available online. △ Less

Submitted 30 September, 2022; v1 submitted 19 March, 2022; originally announced March 2022.

Comments: ECCV 2022 oral

arXiv:2203.09554 [pdf, other]

CoGS: Controllable Generation and Search from Sketch and Style

Authors: Cusuh Ham, Gemma Canet Tarres, Tu Bui, James Hays, Zhe Lin, John Collomosse

Abstract: We present CoGS, a novel method for the style-conditioned, sketch-driven synthesis of images. CoGS enables exploration of diverse appearance possibilities for a given sketched object, enabling decoupled control over the structure and the appearance of the output. Coarse-grained control over object structure and appearance are enabled via an input sketch and an exemplar "style" conditioning image t… ▽ More We present CoGS, a novel method for the style-conditioned, sketch-driven synthesis of images. CoGS enables exploration of diverse appearance possibilities for a given sketched object, enabling decoupled control over the structure and the appearance of the output. Coarse-grained control over object structure and appearance are enabled via an input sketch and an exemplar "style" conditioning image to a transformer-based sketch and style encoder to generate a discrete codebook representation. We map the codebook representation into a metric space, enabling fine-grained control over selection and interpolation between multiple synthesis options before generating the image via a vector quantized GAN (VQGAN) decoder. Our framework thereby unifies search and synthesis tasks, in that a sketch and style pair may be used to run an initial synthesis which may be refined via combination with similar results in a search corpus to produce an image more closely matching the user's intent. We show that our model, trained on the 125 object classes of our newly created Pseudosketches dataset, is capable of producing a diverse gamut of semantic content and appearance styles. △ Less

Submitted 20 July, 2022; v1 submitted 17 March, 2022; originally announced March 2022.

arXiv:2112.13762 [pdf, other]

MSeg: A Composite Dataset for Multi-domain Semantic Segmentation

Authors: John Lambert, Zhuang Liu, Ozan Sener, James Hays, Vladlen Koltun

Abstract: We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains. A naive merge of the constituent datasets yields poor performance due to inconsistent taxonomies and annotation practices. We reconcile the taxonomies and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images, requiring more tha… ▽ More We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains. A naive merge of the constituent datasets yields poor performance due to inconsistent taxonomies and annotation practices. We reconcile the taxonomies and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images, requiring more than 1.34 years of collective annotator effort. The resulting composite dataset enables training a single semantic segmentation model that functions effectively across domains and generalizes to datasets that were not seen during training. We adopt zero-shot cross-dataset transfer as a benchmark to systematically evaluate a model's robustness and show that MSeg training yields substantially more robust models in comparison to training on individual datasets or naive mixing of datasets without the presented contributions. A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training. We evaluate our models in the 2020 Robust Vision Challenge (RVC) as an extreme generalization experiment. MSeg training sets include only three of the seven datasets in the RVC; more importantly, the evaluation taxonomy of RVC is different and more detailed. Surprisingly, our model shows competitive performance and ranks second. To evaluate how close we are to the grand aim of robust, efficient, and complete scene understanding, we go beyond semantic segmentation by training instance segmentation and panoptic segmentation models using our dataset. Moreover, we also evaluate various engineering design decisions and metrics, including resolution and computational efficiency. Although our models are far from this grand aim, our comprehensive evaluation is crucial for progress. We share all the models and code with the community. △ Less

Submitted 27 December, 2021; originally announced December 2021.

arXiv:2111.09930 [pdf, other]

Learning To Estimate Regions Of Attraction Of Autonomous Dynamical Systems Using Physics-Informed Neural Networks

Authors: Cody Scharzenberger, Joe Hays

Abstract: When learning to perform motor tasks in a simulated environment, neural networks must be allowed to explore their action space to discover new potentially viable solutions. However, in an online learning scenario with physical hardware, this exploration must be constrained by relevant safety considerations in order to avoid damage to the agent's hardware and environment. We aim to address this pro… ▽ More When learning to perform motor tasks in a simulated environment, neural networks must be allowed to explore their action space to discover new potentially viable solutions. However, in an online learning scenario with physical hardware, this exploration must be constrained by relevant safety considerations in order to avoid damage to the agent's hardware and environment. We aim to address this problem by training a neural network, which we will refer to as a "safety network", to estimate the region of attraction (ROA) of a controlled autonomous dynamical system. This safety network can thereby be used to quantify the relative safety of proposed control actions and prevent the selection of damaging actions. Here we present our development of the safety network by training an artificial neural network (ANN) to represent the ROA of several autonomous dynamical system benchmark problems. The training of this network is predicated upon both Lyapunov theory and neural solutions to partial differential equations (PDEs). By learning to approximate the viscosity solution to a specially chosen PDE that contains the dynamics of the system of interest, the safety network learns to approximate a particular function, similar to a Lyapunov function, whose zero level set is boundary of the ROA. We train our safety network to solve these PDEs in a semi-supervised manner following a modified version of the Physics Informed Neural Network (PINN) approach, utilizing a loss function that penalizes disagreement with the PDE's initial and boundary conditions, as well as non-zero residual and variational terms. In future work we intend to apply this technique to reinforcement learning agents during motor learning tasks. △ Less

Submitted 18 November, 2021; originally announced November 2021.

Comments: 31 pages, 17 figures

arXiv:2111.04113 [pdf, other]

Stable Lifelong Learning: Spiking neurons as a solution to instability in plastic neural networks

Authors: Samuel Schmidgall, Joe Hays

Abstract: Synaptic plasticity poses itself as a powerful method of self-regulated unsupervised learning in neural networks. A recent resurgence of interest has developed in utilizing Artificial Neural Networks (ANNs) together with synaptic plasticity for intra-lifetime learning. Plasticity has been shown to improve the learning capabilities of these networks in generalizing to novel environmental circumstan… ▽ More Synaptic plasticity poses itself as a powerful method of self-regulated unsupervised learning in neural networks. A recent resurgence of interest has developed in utilizing Artificial Neural Networks (ANNs) together with synaptic plasticity for intra-lifetime learning. Plasticity has been shown to improve the learning capabilities of these networks in generalizing to novel environmental circumstances. However, the long-term stability of these trained networks has yet to be examined. This work demonstrates that utilizing plasticity together with ANNs leads to instability beyond the pre-specified lifespan used during training. This instability can lead to the dramatic decline of reward seeking behavior, or quickly lead to reaching environment terminal states. This behavior is shown to hold consistent for several plasticity rules on two different environments across many training time-horizons: a cart-pole balancing problem and a quadrupedal locomotion problem. We present a solution to this instability through the use of spiking neurons. △ Less

Submitted 7 November, 2021; originally announced November 2021.

arXiv:2109.08057 [pdf, other]

Evolutionary Self-Replication as a Mechanism for Producing Artificial Intelligence

Authors: Samuel Schmidgall, Joseph Hays

Abstract: Can reproduction alone in the context of survival produce intelligence in our machines? In this work, self-replication is explored as a mechanism for the emergence of intelligent behavior in modern learning environments. By focusing purely on survival, while undergoing natural selection, evolved organisms are shown to produce meaningful, complex, and intelligent behavior, demonstrating creative so… ▽ More Can reproduction alone in the context of survival produce intelligence in our machines? In this work, self-replication is explored as a mechanism for the emergence of intelligent behavior in modern learning environments. By focusing purely on survival, while undergoing natural selection, evolved organisms are shown to produce meaningful, complex, and intelligent behavior, demonstrating creative solutions to challenging problems without any notion of reward or objectives. Atari and robotic learning environments are re-defined in terms of natural selection, and the behavior which emerged in self-replicating organisms during these experiments is described in detail. △ Less

Submitted 23 September, 2022; v1 submitted 16 September, 2021; originally announced September 2021.

arXiv:2106.02681 [pdf, other]

doi 10.3389/fnbot.2021.629210

SpikePropamine: Differentiable Plasticity in Spiking Neural Networks

Authors: Samuel Schmidgall, Julia Ashkanazy, Wallace Lawson, Joe Hays

Abstract: The adaptive changes in synaptic efficacy that occur between spiking neurons have been demonstrated to play a critical role in learning for biological neural networks. Despite this source of inspiration, many learning focused applications using Spiking Neural Networks (SNNs) retain static synaptic connections, preventing additional learning after the initial training period. Here, we introduce a f… ▽ More The adaptive changes in synaptic efficacy that occur between spiking neurons have been demonstrated to play a critical role in learning for biological neural networks. Despite this source of inspiration, many learning focused applications using Spiking Neural Networks (SNNs) retain static synaptic connections, preventing additional learning after the initial training period. Here, we introduce a framework for simultaneously learning the underlying fixed-weights and the rules governing the dynamics of synaptic plasticity and neuromodulated synaptic plasticity in SNNs through gradient descent. We further demonstrate the capabilities of this framework on a series of challenging benchmarks, learning the parameters of several plasticity rules including BCM, Oja's, and their respective set of neuromodulatory variants. The experimental results display that SNNs augmented with differentiable plasticity are sufficient for solving a set of challenging temporal learning tasks that a traditional SNN fails to solve, even in the presence of significant noise. These networks are also shown to be capable of producing locomotion on a high-dimensional robotic learning task, where near-minimal degradation in performance is observed in the presence of novel conditions not seen during the initial training period. △ Less

Submitted 4 June, 2021; originally announced June 2021.

Journal ref: Frontiers in Neurorobotics, 22 September 2021

arXiv:2101.02697 [pdf, other]

PVA: Pixel-aligned Volumetric Avatars

Authors: Amit Raj, Michael Zollhoefer, Tomas Simon, Jason Saragih, Shunsuke Saito, James Hays, Stephen Lombardi

Abstract: Acquisition and rendering of photo-realistic human heads is a highly challenging research problem of particular importance for virtual telepresence. Currently, the highest quality is achieved by volumetric approaches trained in a person specific manner on multi-view data. These models better represent fine structure, such as hair, compared to simpler mesh-based models. Volumetric models typically… ▽ More Acquisition and rendering of photo-realistic human heads is a highly challenging research problem of particular importance for virtual telepresence. Currently, the highest quality is achieved by volumetric approaches trained in a person specific manner on multi-view data. These models better represent fine structure, such as hair, compared to simpler mesh-based models. Volumetric models typically employ a global code to represent facial expressions, such that they can be driven by a small set of animation parameters. While such architectures achieve impressive rendering quality, they can not easily be extended to the multi-identity setting. In this paper, we devise a novel approach for predicting volumetric avatars of the human head given just a small number of inputs. We enable generalization across identities by a novel parameterization that combines neural radiance fields with local, pixel-aligned features extracted directly from the inputs, thus sidestepping the need for very deep or complex networks. Our approach is trained in an end-to-end manner solely based on a photometric re-rendering loss without requiring explicit 3D supervision.We demonstrate that our approach outperforms the existing state of the art in terms of quality and is able to generate faithful facial expressions in a multi-identity setting. △ Less

Submitted 7 January, 2021; originally announced January 2021.

Comments: Project page located at https://volumetric-avatars.github.io/

arXiv:2012.12890 [pdf, other]

ANR: Articulated Neural Rendering for Virtual Avatars

Authors: Amit Raj, Julian Tanke, James Hays, Minh Vo, Carsten Stoll, Christoph Lassner

Abstract: The combination of traditional rendering with neural networks in Deferred Neural Rendering (DNR) provides a compelling balance between computational complexity and realism of the resulting images. Using skinned meshes for rendering articulating objects is a natural extension for the DNR framework and would open it up to a plethora of applications. However, in this case the neural shading step must… ▽ More The combination of traditional rendering with neural networks in Deferred Neural Rendering (DNR) provides a compelling balance between computational complexity and realism of the resulting images. Using skinned meshes for rendering articulating objects is a natural extension for the DNR framework and would open it up to a plethora of applications. However, in this case the neural shading step must account for deformations that are possibly not captured in the mesh, as well as alignment inaccuracies and dynamics -- which can confound the DNR pipeline. We present Articulated Neural Rendering (ANR), a novel framework based on DNR which explicitly addresses its limitations for virtual human avatars. We show the superiority of ANR not only with respect to DNR but also with methods specialized for avatar creation and animation. In two user studies, we observe a clear preference for our avatar model and we demonstrate state-of-the-art performance on quantitative evaluation metrics. Perceptually, we observe better temporal stability, level of detail and plausibility. △ Less

Submitted 23 December, 2020; originally announced December 2020.

arXiv:2011.00320 [pdf, other]

Scene Flow from Point Clouds with or without Learning

Authors: Jhony Kaesemodel Pontes, James Hays, Simon Lucey

Abstract: Scene flow is the three-dimensional (3D) motion field of a scene. It provides information about the spatial arrangement and rate of change of objects in dynamic environments. Current learning-based approaches seek to estimate the scene flow directly from point clouds and have achieved state-of-the-art performance. However, supervised learning methods are inherently domain specific and require a la… ▽ More Scene flow is the three-dimensional (3D) motion field of a scene. It provides information about the spatial arrangement and rate of change of objects in dynamic environments. Current learning-based approaches seek to estimate the scene flow directly from point clouds and have achieved state-of-the-art performance. However, supervised learning methods are inherently domain specific and require a large amount of labeled data. Annotation of scene flow on real-world point clouds is expensive and challenging, and the lack of such datasets has recently sparked interest in self-supervised learning methods. How to accurately and robustly learn scene flow representations without labeled real-world data is still an open problem. Here we present a simple and interpretable objective function to recover the scene flow from point clouds. We use the graph Laplacian of a point cloud to regularize the scene flow to be "as-rigid-as-possible". Our proposed objective function can be used with or without learning---as a self-supervisory signal to learn scene flow representations, or as a non-learning-based method in which the scene flow is optimized during runtime. Our approach outperforms related works in many datasets. We also show the immediate applications of our proposed method for two applications: motion segmentation and point cloud densification. △ Less

Submitted 31 October, 2020; originally announced November 2020.

Comments: International Conference on 3D Vision (3DV 2020)

arXiv:2008.10592 [pdf, other]

3D for Free: Crossmodal Transfer Learning using HD Maps

Authors: Benjamin Wilson, Zsolt Kira, James Hays

Abstract: 3D object detection is a core perceptual challenge for robotics and autonomous driving. However, the class-taxonomies in modern autonomous driving datasets are significantly smaller than many influential 2D detection datasets. In this work, we address the long-tail problem by leveraging both the large class-taxonomies of modern 2D datasets and the robustness of state-of-the-art 2D detection method… ▽ More 3D object detection is a core perceptual challenge for robotics and autonomous driving. However, the class-taxonomies in modern autonomous driving datasets are significantly smaller than many influential 2D detection datasets. In this work, we address the long-tail problem by leveraging both the large class-taxonomies of modern 2D datasets and the robustness of state-of-the-art 2D detection methods. We proceed to mine a large, unlabeled dataset of images and LiDAR, and estimate 3D object bounding cuboids, seeded from an off-the-shelf 2D instance segmentation model. Critically, we constrain this ill-posed 2D-to-3D mapping by using high-definition maps and object size priors. The result of the mining process is 3D cuboids with varying confidence. This mining process is itself a 3D object detector, although not especially accurate when evaluated as such. However, we then train a 3D object detection model on these cuboids, consistent with other recent observations in the deep learning literature, we find that the resulting model is fairly robust to the noisy supervision that our mining process provides. We mine a collection of 1151 unlabeled, multimodal driving logs from an autonomous vehicle and use the discovered objects to train a LiDAR-based object detector. We show that detector performance increases as we mine more unlabeled data. With our full, unlabeled dataset, our method performs competitively with fully supervised methods, even exceeding the performance for certain object categories, without any human 3D annotations. △ Less

Submitted 24 August, 2020; originally announced August 2020.

arXiv:2008.08115 [pdf, other]

TIDE: A General Toolbox for Identifying Object Detection Errors

Authors: Daniel Bolya, Sean Foley, James Hays, Judy Hoffman

Abstract: We introduce TIDE, a framework and associated toolbox for analyzing the sources of error in object detection and instance segmentation algorithms. Importantly, our framework is applicable across datasets and can be applied directly to output prediction files without required knowledge of the underlying prediction system. Thus, our framework can be used as a drop-in replacement for the standard mAP… ▽ More We introduce TIDE, a framework and associated toolbox for analyzing the sources of error in object detection and instance segmentation algorithms. Importantly, our framework is applicable across datasets and can be applied directly to output prediction files without required knowledge of the underlying prediction system. Thus, our framework can be used as a drop-in replacement for the standard mAP computation while providing a comprehensive analysis of each model's strengths and weaknesses. We segment errors into six types and, crucially, are the first to introduce a technique for measuring the contribution of each error in a way that isolates its effect on overall performance. We show that such a representation is critical for drawing accurate, comprehensive conclusions through in-depth analysis across 4 datasets and 7 recognition models. Available at https://dbolya.github.io/tide/ △ Less

Submitted 31 August, 2020; v1 submitted 18 August, 2020; originally announced August 2020.

Comments: Updated LVIS results with the v1.0.1 error calculation

arXiv:2007.09545 [pdf, other]

ContactPose: A Dataset of Grasps with Object Contact and Hand Pose

Authors: Samarth Brahmbhatt, Chengcheng Tang, Christopher D. Twigg, Charles C. Kemp, James Hays

Abstract: Grasping is natural for humans. However, it involves complex hand configurations and soft tissue deformation that can result in complicated regions of contact between the hand and the object. Understanding and modeling this contact can potentially improve hand models, AR/VR experiences, and robotic grasping. Yet, we currently lack datasets of hand-object contact paired with other data modalities,… ▽ More Grasping is natural for humans. However, it involves complex hand configurations and soft tissue deformation that can result in complicated regions of contact between the hand and the object. Understanding and modeling this contact can potentially improve hand models, AR/VR experiences, and robotic grasping. Yet, we currently lack datasets of hand-object contact paired with other data modalities, which is crucial for developing and evaluating contact modeling techniques. We introduce ContactPose, the first dataset of hand-object contact paired with hand pose, object pose, and RGB-D images. ContactPose has 2306 unique grasps of 25 household objects grasped with 2 functional intents by 50 participants, and more than 2.9 M RGB-D grasp images. Analysis of ContactPose data reveals interesting relationships between hand pose and contact. We use this data to rigorously evaluate various data representations, heuristics from the literature, and learning methods for contact modeling. Data, code, and trained models are available at https://contactpose.cc.gatech.edu. △ Less

Submitted 18 July, 2020; originally announced July 2020.

Comments: The European Conference on Computer Vision (ECCV) 2020

arXiv:1911.02620 [pdf, other]

Argoverse: 3D Tracking and Forecasting with Rich Maps

Authors: Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, James Hays

Abstract: We present Argoverse -- two datasets designed to support autonomous vehicle machine learning tasks such as 3D tracking and motion forecasting. Argoverse was collected by a fleet of autonomous vehicles in Pittsburgh and Miami. The Argoverse 3D Tracking dataset includes 360 degree images from 7 cameras with overlapping fields of view, 3D point clouds from long range LiDAR, 6-DOF pose, and 3D track a… ▽ More We present Argoverse -- two datasets designed to support autonomous vehicle machine learning tasks such as 3D tracking and motion forecasting. Argoverse was collected by a fleet of autonomous vehicles in Pittsburgh and Miami. The Argoverse 3D Tracking dataset includes 360 degree images from 7 cameras with overlapping fields of view, 3D point clouds from long range LiDAR, 6-DOF pose, and 3D track annotations. Notably, it is the only modern AV dataset that provides forward-facing stereo imagery. The Argoverse Motion Forecasting dataset includes more than 300,000 5-second tracked scenarios with a particular vehicle identified for trajectory forecasting. Argoverse is the first autonomous vehicle dataset to include "HD maps" with 290 km of mapped lanes with geometric and semantic metadata. All data is released under a Creative Commons license at www.argoverse.org. In our baseline experiments, we illustrate how detailed map information such as lane direction, driveable area, and ground height improves the accuracy of 3D object tracking and motion forecasting. Our tracking and forecasting experiments represent only an initial exploration of the use of rich maps in robotic perception. We hope that Argoverse will enable the research community to explore these problems in greater depth. △ Less

Submitted 6 November, 2019; originally announced November 2019.

Comments: CVPR 2019

arXiv:1907.07388 [pdf, other]

Towards Markerless Grasp Capture

Authors: Samarth Brahmbhatt, Charles C. Kemp, James Hays

Abstract: Humans excel at grasping objects and manipulating them. Capturing human grasps is important for understanding grasping behavior and reconstructing it realistically in Virtual Reality (VR). However, grasp capture - capturing the pose of a hand grasping an object, and orienting it w.r.t. the object - is difficult because of the complexity and diversity of the human hand, and occlusion. Reflective ma… ▽ More Humans excel at grasping objects and manipulating them. Capturing human grasps is important for understanding grasping behavior and reconstructing it realistically in Virtual Reality (VR). However, grasp capture - capturing the pose of a hand grasping an object, and orienting it w.r.t. the object - is difficult because of the complexity and diversity of the human hand, and occlusion. Reflective markers and magnetic trackers traditionally used to mitigate this difficulty introduce undesirable artifacts in images and can interfere with natural grasping behavior. We present preliminary work on a completely marker-less algorithm for grasp capture from a video depicting a grasp. We show how recent advances in 2D hand pose estimation can be used with well-established optimization techniques. Uniquely, our algorithm can also capture hand-object contact in detail and integrate it in the grasp capture process. This is work in progress, find more details at https://contactdb. cc.gatech.edu/grasp_capture.html. △ Less

Submitted 17 July, 2019; originally announced July 2019.

Comments: Third Workshop on Computer Vision for AR/VR, CVPR 2019

arXiv:1905.05882 [pdf, other]

Kernel Mean Matching for Content Addressability of GANs

Authors: Wittawat Jitkrittum, Patsorn Sangkloy, Muhammad Waleed Gondal, Amit Raj, James Hays, Bernhard Schölkopf

Abstract: We propose a novel procedure which adds "content-addressability" to any given unconditional implicit model e.g., a generative adversarial network (GAN). The procedure allows users to control the generative process by specifying a set (arbitrary size) of desired examples based on which similar samples are generated from the model. The proposed approach, based on kernel mean matching, is applicable… ▽ More We propose a novel procedure which adds "content-addressability" to any given unconditional implicit model e.g., a generative adversarial network (GAN). The procedure allows users to control the generative process by specifying a set (arbitrary size) of desired examples based on which similar samples are generated from the model. The proposed approach, based on kernel mean matching, is applicable to any generative models which transform latent vectors to samples, and does not require retraining of the model. Experiments on various high-dimensional image generation problems (CelebA-HQ, LSUN bedroom, bridge, tower) show that our approach is able to generate images which are consistent with the input set, while retaining the image quality of the original model. To our knowledge, this is the first work that attempts to construct, at test time, a content-addressable generative model from a trained marginal model. △ Less

Submitted 14 May, 2019; originally announced May 2019.

Comments: Wittawat Jitkrittum and Patsorn Sangkloy contributed equally to this work

arXiv:1904.06830 [pdf, other]

ContactDB: Analyzing and Predicting Grasp Contact via Thermal Imaging

Authors: Samarth Brahmbhatt, Cusuh Ham, Charles C. Kemp, James Hays

Abstract: Grasping and manipulating objects is an important human skill. Since hand-object contact is fundamental to grasping, capturing it can lead to important insights. However, observing contact through external sensors is challenging because of occlusion and the complexity of the human hand. We present ContactDB, a novel dataset of contact maps for household objects that captures the rich hand-object c… ▽ More Grasping and manipulating objects is an important human skill. Since hand-object contact is fundamental to grasping, capturing it can lead to important insights. However, observing contact through external sensors is challenging because of occlusion and the complexity of the human hand. We present ContactDB, a novel dataset of contact maps for household objects that captures the rich hand-object contact that occurs during grasping, enabled by use of a thermal camera. Participants in our study grasped 3D printed objects with a post-grasp functional intent. ContactDB includes 3750 3D meshes of 50 household objects textured with contact maps and 375K frames of synchronized RGB-D+thermal images. To the best of our knowledge, this is the first large-scale dataset that records detailed contact maps for human grasps. Analysis of this data shows the influence of functional intent and object size on grasping, the tendency to touch/avoid 'active areas', and the high frequency of palm and proximal finger contact. Finally, we train state-of-the-art image translation and 3D convolution algorithms to predict diverse contact patterns from object shape. Data, code and models are available at https://contactdb.cc.gatech.edu. △ Less

Submitted 14 April, 2019; originally announced April 2019.

Comments: CVPR 2019 Oral

arXiv:1904.03754 [pdf, other]

ContactGrasp: Functional Multi-finger Grasp Synthesis from Contact

Authors: Samarth Brahmbhatt, Ankur Handa, James Hays, Dieter Fox

Abstract: Grasping and manipulating objects is an important human skill. Since most objects are designed to be manipulated by human hands, anthropomorphic hands can enable richer human-robot interaction. Desirable grasps are not only stable, but also functional: they enable post-grasp actions with the object. However, functional grasp synthesis for high degree-of-freedom anthropomorphic hands from object sh… ▽ More Grasping and manipulating objects is an important human skill. Since most objects are designed to be manipulated by human hands, anthropomorphic hands can enable richer human-robot interaction. Desirable grasps are not only stable, but also functional: they enable post-grasp actions with the object. However, functional grasp synthesis for high degree-of-freedom anthropomorphic hands from object shape alone is challenging because of the large optimization space. We present ContactGrasp, a framework for functional grasp synthesis from object shape and contact on the object surface. Contact can be manually specified or obtained through demonstrations. Our contact representation is object-centric and allows functional grasp synthesis even for hand models different than the one used for demonstration. Using a dataset of contact demonstrations from humans grasping diverse household objects, we synthesize functional grasps for three hand models and two functional intents. The project webpage is https://contactdb.cc.gatech.edu/contactgrasp.html. △ Less

Submitted 25 July, 2019; v1 submitted 7 April, 2019; originally announced April 2019.

Comments: IROS 2019 camera ready version

arXiv:1903.00793 [pdf, other]

Let's Transfer Transformations of Shared Semantic Representations

Authors: Nam Vo, Lu Jiang, James Hays

Abstract: With a good image understanding capability, can we manipulate the images high level semantic representation? Such transformation operation can be used to generate or retrieve similar images but with a desired modification (for example changing beach background to street background); similar ability has been demonstrated in zero shot learning, attribute composition and attribute manipulation image… ▽ More With a good image understanding capability, can we manipulate the images high level semantic representation? Such transformation operation can be used to generate or retrieve similar images but with a desired modification (for example changing beach background to street background); similar ability has been demonstrated in zero shot learning, attribute composition and attribute manipulation image search. In this work we show how one can learn transformations with no training examples by learning them on another domain and then transfer to the target domain. This is feasible if: first, transformation training data is more accessible in the other domain and second, both domains share similar semantics such that one can learn transformations in a shared embedding space. We demonstrate this on an image retrieval task where search query is an image, plus an additional transformation specification (for example: search for images similar to this one but background is a street instead of a beach). In one experiment, we transfer transformation from synthesized 2D blobs image to 3D rendered image, and in the other, we transfer from text domain to natural image domain. △ Less

Submitted 2 March, 2019; originally announced March 2019.

arXiv:1812.07119 [pdf, other]

Composing Text and Image for Image Retrieval - An Empirical Odyssey

Authors: Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays

Abstract: In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image. For example, we may present an image of the Eiffel tower, and ask the system to find images which are visually similar but are modified in small ways, such as being taken at nighttime instead of during the day. To ta… ▽ More In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image. For example, we may present an image of the Eiffel tower, and ask the system to find images which are visually similar but are modified in small ways, such as being taken at nighttime instead of during the day. To tackle this task, we learn a similarity metric between a target image and a source image plus source text, an embedding and composing function such that target image feature is close to the source image plus text composition feature. We propose a new way to combine image and text using such function that is designed for the retrieval task. We show this outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset we create based on CLEVR. We also show that our approach can be used to classify input queries, in addition to image retrieval. △ Less

Submitted 17 December, 2018; originally announced December 2018.

arXiv:1810.11630 [pdf, other]

Informative Features for Model Comparison

Authors: Wittawat Jitkrittum, Heishiro Kanagawa, Patsorn Sangkloy, James Hays, Bernhard Schölkopf, Arthur Gretton

Abstract: Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a set of examples (informative features) indicating… ▽ More Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a set of examples (informative features) indicating the regions in the data domain where one model fits significantly better than the other. In a real-world problem of comparing GAN models, the test power of our new test matches that of the state-of-the-art test of relative goodness of fit, while being one order of magnitude faster. △ Less

Submitted 27 October, 2018; originally announced October 2018.

Comments: Accepted to NIPS 2018

MSC Class: 46E22; 62G10 ACM Class: G.3; I.2.6

arXiv:1803.03310 [pdf, other]

Generalization in Metric Learning: Should the Embedding Layer be the Embedding Layer?

Authors: Nam Vo, James Hays

Abstract: This work studies deep metric learning under small to medium scale data as we believe that better generalization could be a contributing factor to the improvement of previous fine-grained image retrieval methods; it should be considered when designing future techniques. In particular, we investigate using other layers in a deep metric learning system (besides the embedding layer) for feature extra… ▽ More This work studies deep metric learning under small to medium scale data as we believe that better generalization could be a contributing factor to the improvement of previous fine-grained image retrieval methods; it should be considered when designing future techniques. In particular, we investigate using other layers in a deep metric learning system (besides the embedding layer) for feature extraction and analyze how well they perform on training data and generalize to testing data. From this study, we suggest a new regularization practice where one can add or choose a more optimal layer for feature extraction. State-of-the-art performance is demonstrated on 3 fine-grained image retrieval benchmarks: Cars-196, CUB-200-2011, and Stanford Online Product. △ Less

Submitted 10 December, 2018; v1 submitted 8 March, 2018; originally announced March 2018.

Comments: new version for WACV

arXiv:1801.07388 [pdf, other]

Let's Dance: Learning From Online Dance Videos

Authors: Daniel Castro, Steven Hickson, Patsorn Sangkloy, Bhavishya Mittal, Sean Dai, James Hays, Irfan Essa

Abstract: In recent years, deep neural network approaches have naturally extended to the video domain, in their simplest case by aggregating per-frame classifications as a baseline for action recognition. A majority of the work in this area extends from the imaging domain, leading to visual-feature heavy approaches on temporal data. To address this issue we introduce "Let's Dance", a 1000 video dataset (and… ▽ More In recent years, deep neural network approaches have naturally extended to the video domain, in their simplest case by aggregating per-frame classifications as a baseline for action recognition. A majority of the work in this area extends from the imaging domain, leading to visual-feature heavy approaches on temporal data. To address this issue we introduce "Let's Dance", a 1000 video dataset (and growing) comprised of 10 visually overlapping dance categories that require motion for their classification. We stress the important of human motion as a key distinguisher in our work given that, as we show in this work, visual information is not sufficient to classify motion-heavy categories. We compare our datasets' performance using imaging techniques with UCF-101 and demonstrate this inherent difficulty. We present a comparison of numerous state-of-the-art techniques on our dataset using three different representations (video, optical flow and multi-person pose data) in order to analyze these approaches. We discuss the motion parameterization of each of them and their value in learning to categorize online dance videos. Lastly, we release this dataset (and its three representations) for the research community to use. △ Less

Submitted 22 January, 2018; originally announced January 2018.

Comments: first submitted November 2016

ACM Class: I.4; I.5; I.5.1

arXiv:1801.02753 [pdf, other]

SketchyGAN: Towards Diverse and Realistic Sketch to Image Synthesis

Authors: Wengling Chen, James Hays

Abstract: Synthesizing realistic images from human drawn sketches is a challenging problem in computer graphics and vision. Existing approaches either need exact edge maps, or rely on retrieval of existing photographs. In this work, we propose a novel Generative Adversarial Network (GAN) approach that synthesizes plausible images from 50 categories including motorcycles, horses and couches. We demonstrate a… ▽ More Synthesizing realistic images from human drawn sketches is a challenging problem in computer graphics and vision. Existing approaches either need exact edge maps, or rely on retrieval of existing photographs. In this work, we propose a novel Generative Adversarial Network (GAN) approach that synthesizes plausible images from 50 categories including motorcycles, horses and couches. We demonstrate a data augmentation technique for sketches which is fully automatic, and we show that the augmented data is helpful to our task. We introduce a new network building block suitable for both the generator and discriminator which improves the information flow by injecting the input image at multiple scales. Compared to state-of-the-art image translation methods, our approach generates more realistic images and achieves significantly higher Inception Scores. △ Less

Submitted 12 April, 2018; v1 submitted 8 January, 2018; originally announced January 2018.

Comments: Accepted to CVPR 2018

Showing 1–50 of 64 results for author: Hays, J