Skip to main content

Showing 1–25 of 25 results for author: Harley, A W

  1. arXiv:2405.19678  [pdf, other

    cs.CV cs.AI

    View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

    Authors: Haodi He, Colton Stearns, Adam W. Harley, Leonidas J. Guibas

    Abstract: Large-scale vision foundation models such as Segment Anything (SAM) demonstrate impressive performance in zero-shot image segmentation at multiple levels of granularity. However, these zero-shot predictions are rarely 3D-consistent. As the camera viewpoint changes in a scene, so do the segmentation predictions, as well as the characterizations of "coarse" or "fine" granularity. In this work, we ad… ▽ More

    Submitted 17 July, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

  2. arXiv:2401.02416  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    ODIN: A Single Model for 2D and 3D Segmentation

    Authors: Ayush Jain, Pushkal Katara, Nikolaos Gkanatsios, Adam W. Harley, Gabriel Sarch, Kriti Aggarwal, Vishrav Chaudhary, Katerina Fragkiadaki

    Abstract: State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that… ▽ More

    Submitted 25 June, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

    Comments: Camera Ready (CVPR 2024, Highlight)

  3. arXiv:2401.00850  [pdf, other

    cs.CV cs.AI

    Refining Pre-Trained Motion Models

    Authors: Xinglong Sun, Adam W. Harley, Leonidas J. Guibas

    Abstract: Given the difficulty of manually annotating motion in video, the current best motion estimation methods are trained with synthetic data, and therefore struggle somewhat due to a train/test gap. Self-supervised methods hold the promise of training directly on real video, but typically perform worse. These include methods trained with warp error (i.e., color constancy) combined with smoothness terms… ▽ More

    Submitted 16 February, 2024; v1 submitted 1 January, 2024; originally announced January 2024.

    Comments: Accepted at ICRA 2024

  4. arXiv:2312.15130  [pdf, other

    cs.CV

    PACE: A Large-Scale Dataset with Pose Annotations in Cluttered Environments

    Authors: Yang You, Kai Xiong, Zhening Yang, Zhengxiang Huang, Junwei Zhou, Ruoxi Shi, Zhou Fang, Adam W. Harley, Leonidas Guibas, Cewu Lu

    Abstract: Pose estimation is a crucial task in computer vision and robotics, enabling the tracking and manipulation of objects in images or videos. While several datasets exist for pose estimation, there is a lack of large-scale datasets specifically focusing on cluttered scenes with occlusions. We introduce PACE (Pose Annotations in Cluttered Environments), a large-scale benchmark designed to advance the d… ▽ More

    Submitted 31 March, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

  5. arXiv:2310.06992  [pdf, other

    cs.CV

    Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

    Authors: Wen-Hsuan Chu, Adam W. Harley, Pavel Tokmakov, Achal Dave, Leonidas Guibas, Katerina Fragkiadaki

    Abstract: Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained… ▽ More

    Submitted 25 January, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

    Comments: Project page available at https://wenhsuanchu.github.io/ovtracktor/

  6. arXiv:2309.03468  [pdf, other

    cs.CV cs.AI cs.LG

    Cross-Image Context Matters for Bongard Problems

    Authors: Nikhil Raghuraman, Adam W. Harley, Leonidas Guibas

    Abstract: Current machine learning methods struggle to solve Bongard problems, which are a type of IQ test that requires deriving an abstract "concept" from a set of positive and negative "support" images, and then classifying whether or not a new query image depicts the key concept. On Bongard-HOI, a benchmark for natural-image Bongard problems, existing methods have only reached 66% accuracy (where chance… ▽ More

    Submitted 6 September, 2023; originally announced September 2023.

    Comments: Main paper: 7 pages, Appendix: 10 pages, 30 figures. Code: https://github.com/nraghuraman/bongard-context

  7. arXiv:2307.15055  [pdf, other

    cs.CV

    PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

    Authors: Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, Leonidas J. Guibas

    Abstract: We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms. Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. Toward the goal of naturalism, we animate deformable characters using real-world motion capture data, we build 3D scenes to m… ▽ More

    Submitted 27 July, 2023; originally announced July 2023.

  8. arXiv:2207.10761  [pdf, other

    cs.CV

    TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors

    Authors: Gabriel Sarch, Zhaoyuan Fang, Adam W. Harley, Paul Schydlo, Michael J. Tarr, Saurabh Gupta, Katerina Fragkiadaki

    Abstract: We introduce TIDEE, an embodied agent that tidies up a disordered scene based on learned commonsense object placement and room arrangement priors. TIDEE explores a home environment, detects objects that are out of their natural place, infers plausible object contexts for them, localizes such contexts in the current scene, and repositions the objects. Commonsense priors are encoded in three modules… ▽ More

    Submitted 21 July, 2022; originally announced July 2022.

  9. arXiv:2206.07959  [pdf, other

    cs.CV

    Simple-BEV: What Really Matters for Multi-Sensor BEV Perception?

    Authors: Adam W. Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, Katerina Fragkiadaki

    Abstract: Building 3D perception systems for autonomous vehicles that do not rely on high-density LiDAR is a critical research problem because of the expense of LiDAR systems compared to cameras and other sensors. Recent research has developed a variety of camera-only methods, where features are differentiably "lifted" from the multi-camera images onto the 2D ground plane, yielding a "bird's eye view" (BEV)… ▽ More

    Submitted 29 September, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

  10. arXiv:2204.04153  [pdf, other

    cs.CV

    Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories

    Authors: Adam W. Harley, Zhaoyuan Fang, Katerina Fragkiadaki

    Abstract: Tracking pixels in videos is typically studied as an optical flow estimation problem, where every pixel is described with a displacement vector that locates it in the next frame. Even though wider temporal context is freely available, prior efforts to take this into account have yielded only small gains over 2-frame methods. In this paper, we revisit Sand and Teller's "particle video" approach, an… ▽ More

    Submitted 25 July, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

  11. arXiv:2104.03851  [pdf, other

    cs.CV

    CoCoNets: Continuous Contrastive 3D Scene Representations

    Authors: Shamit Lal, Mihir Prabhudesai, Ishita Mediratta, Adam W. Harley, Katerina Fragkiadaki

    Abstract: This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos, agnostic to object and scene semantic content, and evaluates the resulting scene representations in the downstream tasks of visual correspondence, object tracking, and object detection. The model infers a latent3D representation of the scene in the form of 3D feature points… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

  12. arXiv:2104.03424  [pdf, other

    cs.CV

    Track, Check, Repeat: An EM Approach to Unsupervised Tracking

    Authors: Adam W. Harley, Yiming Zuo, Jing Wen, Ayush Mangal, Shubhankar Potdar, Ritwick Chaudhry, Katerina Fragkiadaki

    Abstract: We propose an unsupervised method for detecting and tracking moving objects in 3D, in unlabelled RGB-D videos. The method begins with classic handcrafted techniques for segmenting objects using motion cues: we estimate optical flow and camera motion, and conservatively segment regions that appear to be moving independently of the background. Treating these initial segments as pseudo-labels, we lea… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

  13. arXiv:2012.00057  [pdf, other

    cs.CV cs.AI cs.LG

    Move to See Better: Self-Improving Embodied Object Detection

    Authors: Zhaoyuan Fang, Ayush Jain, Gabriel Sarch, Adam W. Harley, Katerina Fragkiadaki

    Abstract: Passive methods for object detection and segmentation treat images of the same scene as individual samples and do not exploit object permanence across multiple views. Generalization to novel or difficult viewpoints thus requires additional training with lots of annotations. In contrast, humans often recognize objects by simply moving around, to get more informative viewpoints. In this paper, we pr… ▽ More

    Submitted 29 March, 2021; v1 submitted 30 November, 2020; originally announced December 2020.

    Comments: First three authors contributed equally. Project Page: https://ayushjain1144.github.io/SeeingByMoving/

  14. arXiv:2011.03367  [pdf, other

    cs.CV

    Disentangling 3D Prototypical Networks For Few-Shot Concept Learning

    Authors: Mihir Prabhudesai, Shamit Lal, Darshan Patil, Hsiao-Yu Tung, Adam W Harley, Katerina Fragkiadaki

    Abstract: We present neural architectures that disentangle RGB-D images into objects' shapes and styles and a map of the background scene, and explore their applications for few-shot 3D object detection and few-shot concept classification. Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay. They are trained end-to… ▽ More

    Submitted 20 July, 2021; v1 submitted 6 November, 2020; originally announced November 2020.

  15. arXiv:2010.16279  [pdf, other

    cs.CV

    3D Object Recognition By Corresponding and Quantizing Neural 3D Scene Representations

    Authors: Mihir Prabhudesai, Shamit Lal, Hsiao-Yu Fish Tung, Adam W. Harley, Shubhankar Potdar, Katerina Fragkiadaki

    Abstract: We propose a system that learns to detect objects and infer their 3D poses in RGB-D images. Many existing systems can identify objects and infer 3D poses, but they heavily rely on human labels and 3D annotations. The challenge here is to achieve this without relying on strong supervision signals. To address this challenge, we propose a model that maps RGB-D images to a set of 3D visual feature map… ▽ More

    Submitted 30 October, 2020; originally announced October 2020.

  16. arXiv:2008.01295  [pdf, other

    cs.CV

    Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping

    Authors: Adam W. Harley, Shrinidhi K. Lakshmikanth, Paul Schydlo, Katerina Fragkiadaki

    Abstract: We hypothesize that an agent that can look around in static scenes can learn rich visual representations applicable to 3D object tracking in complex dynamic scenes. We are motivated in this pursuit by the fact that the physical world itself is mostly static, and multiview correspondence labels are relatively cheap to collect in static scenes, e.g., by triangulation. We propose to leverage multivie… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

  17. arXiv:1910.01210  [pdf, other

    cs.CV cs.LG cs.RO

    Embodied Language Grounding with 3D Visual Feature Representations

    Authors: Mihir Prabhudesai, Hsiao-Yu Fish Tung, Syed Ashar Javed, Maximilian Sieb, Adam W. Harley, Katerina Fragkiadaki

    Abstract: We propose associating language utterances to 3D visual abstractions of the scene they describe. The 3D visual abstractions are encoded as 3-dimensional visual feature maps. We infer these 3D visual scene feature maps from RGB images of the scene via view prediction: when the generated 3D scene feature map is neurally projected from a camera viewpoint, it should match the corresponding RGB image.… ▽ More

    Submitted 17 June, 2021; v1 submitted 2 October, 2019; originally announced October 2019.

    Journal ref: Conference on Computer Vision and Pattern Recognition. 2020, pp. 2220-2229

  18. arXiv:1906.03764  [pdf, other

    cs.CV

    Learning from Unlabelled Videos Using Contrastive Predictive Neural 3D Mapping

    Authors: Adam W. Harley, Shrinidhi K. Lakshmikanth, Fangyu Li, Xian Zhou, Hsiao-Yu Fish Tung, Katerina Fragkiadaki

    Abstract: Predictive coding theories suggest that the brain learns by predicting observations at various levels of abstraction. One of the most basic prediction tasks is view prediction: how would a given scene look from an alternative viewpoint? Humans excel at this task. Our ability to imagine and fill in missing information is tightly coupled with perception: we feel as if we see the world in 3 dimension… ▽ More

    Submitted 16 May, 2020; v1 submitted 9 June, 2019; originally announced June 2019.

  19. arXiv:1901.03628  [pdf, other

    cs.CV

    Image Disentanglement and Uncooperative Re-Entanglement for High-Fidelity Image-to-Image Translation

    Authors: Adam W. Harley, Shih-En Wei, Jason Saragih, Katerina Fragkiadaki

    Abstract: Cross-domain image-to-image translation should satisfy two requirements: (1) preserve the information that is common to both domains, and (2) generate convincing images covering variations that appear in the target domain. This is challenging, especially when there are no example translations available as supervision. Adversarial cycle consistency was recently proposed as a solution, with beautifu… ▽ More

    Submitted 19 October, 2019; v1 submitted 11 January, 2019; originally announced January 2019.

  20. arXiv:1804.10692  [pdf, other

    cs.CV cs.RO

    Reward Learning from Narrated Demonstrations

    Authors: Hsiao-Yu Fish Tung, Adam W. Harley, Liang-Kang Huang, Katerina Fragkiadaki

    Abstract: Humans effortlessly "program" one another by communicating goals and desires in natural language. In contrast, humans program robotic behaviours by indicating desired object locations and poses to be achieved, by providing RGB images of goal configurations, or supplying a demonstration to be imitated. None of these methods generalize across environment variations, and they convey the goal in awkwa… ▽ More

    Submitted 27 April, 2018; originally announced April 2018.

    Comments: The work has been accepted to Conference on Computer Vision and Pattern Recognition (CVPR) 2018

  21. arXiv:1708.04607  [pdf, other

    cs.CV

    Segmentation-Aware Convolutional Networks Using Local Attention Masks

    Authors: Adam W. Harley, Konstantinos G. Derpanis, Iasonas Kokkinos

    Abstract: We introduce an approach to integrate segmentation information within a convolutional neural network (CNN). This counter-acts the tendency of CNNs to smooth information across regions and increases their spatial precision. To obtain segmentation information, we set up a CNN to provide an embedding space where region co-membership can be estimated based on Euclidean distance. We use these embedding… ▽ More

    Submitted 15 August, 2017; originally announced August 2017.

  22. arXiv:1705.11166  [pdf, other

    cs.CV

    Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision

    Authors: Hsiao-Yu Fish Tung, Adam W. Harley, William Seto, Katerina Fragkiadaki

    Abstract: Researchers have developed excellent feed-forward models that learn to map images to desired outputs, such as to the images' latent factors, or to other images, using supervised learning. Learning such mappings from unlabelled data, or improving upon supervised models by exploiting unlabelled data, remains elusive. We argue that there are two important parts to learning without annotations: (i) ma… ▽ More

    Submitted 1 September, 2017; v1 submitted 31 May, 2017; originally announced May 2017.

    Journal ref: The IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4354-4362

  23. arXiv:1608.05842  [pdf, other

    cs.CV

    Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness

    Authors: Jason J. Yu, Adam W. Harley, Konstantinos G. Derpanis

    Abstract: Recently, convolutional networks (convnets) have proven useful for predicting optical flow. Much of this success is predicated on the availability of large datasets that require expensive and involved data acquisition and laborious la- beling. To bypass these challenges, we propose an unsuper- vised approach (i.e., without leveraging groundtruth flow) to train a convnet end-to-end for predicting o… ▽ More

    Submitted 20 August, 2016; originally announced August 2016.

  24. arXiv:1511.04377  [pdf, other

    cs.CV

    Learning Dense Convolutional Embeddings for Semantic Segmentation

    Authors: Adam W. Harley, Konstantinos G. Derpanis, Iasonas Kokkinos

    Abstract: This paper proposes a new deep convolutional neural network (DCNN) architecture that learns pixel embeddings, such that pairwise distances between the embeddings can be used to infer whether or not the pixels lie on the same region. That is, for any two pixels on the same object, the embeddings are trained to be similar; for any pair that straddles an object boundary, the embeddings are trained to… ▽ More

    Submitted 7 January, 2016; v1 submitted 13 November, 2015; originally announced November 2015.

  25. arXiv:1502.07058  [pdf, other

    cs.CV cs.IR cs.LG cs.NE

    Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval

    Authors: Adam W. Harley, Alex Ufkes, Konstantinos G. Derpanis

    Abstract: This paper presents a new state-of-the-art for document image classification and retrieval, using features learned by deep convolutional neural networks (CNNs). In object and scene analysis, deep neural nets are capable of learning a hierarchical chain of abstraction from pixel inputs to concise and descriptive representations. The current work explores this capacity in the realm of document analy… ▽ More

    Submitted 25 February, 2015; originally announced February 2015.