-
Objaverse-XL: A Universe of 10M+ 3D Objects
Authors:
Matt Deitke,
Ruoshi Liu,
Matthew Wallingford,
Huong Ngo,
Oscar Michel,
Aditya Kusupati,
Alan Fan,
Christian Laforte,
Vikram Voleti,
Samir Yitzhak Gadre,
Eli VanderBilt,
Aniruddha Kembhavi,
Carl Vondrick,
Georgia Gkioxari,
Kiana Ehsani,
Ludwig Schmidt,
Ali Farhadi
Abstract:
Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects…
▽ More
Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. Representing the largest scale and diversity in the realm of 3D datasets, Objaverse-XL enables significant new possibilities for 3D vision. Our experiments demonstrate the improvements enabled with the scale provided by Objaverse-XL. We show that by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images, we achieve strong zero-shot generalization abilities. We hope that releasing Objaverse-XL will enable further innovations in the field of 3D vision at scale.
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
Objaverse: A Universe of Annotated 3D Objects
Authors:
Matt Deitke,
Dustin Schwenk,
Jordi Salvador,
Luca Weihs,
Oscar Michel,
Eli VanderBilt,
Ludwig Schmidt,
Kiana Ehsani,
Aniruddha Kembhavi,
Ali Farhadi
Abstract:
Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets…
▽ More
Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI.
△ Less
Submitted 15 December, 2022;
originally announced December 2022.
-
Phone2Proc: Bringing Robust Robots Into Our Chaotic World
Authors:
Matt Deitke,
Rose Hendrix,
Luca Weihs,
Ali Farhadi,
Kiana Ehsani,
Aniruddha Kembhavi
Abstract:
Training embodied agents in simulation has become mainstream for the embodied AI community. However, these agents often struggle when deployed in the physical world due to their inability to generalize to real-world environments. In this paper, we present Phone2Proc, a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes that are…
▽ More
Training embodied agents in simulation has become mainstream for the embodied AI community. However, these agents often struggle when deployed in the physical world due to their inability to generalize to real-world environments. In this paper, we present Phone2Proc, a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes that are semantically similar to the target environment. The generated scenes are conditioned on the wall layout and arrangement of large objects from the scan, while also sampling lighting, clutter, surface textures, and instances of smaller objects with randomized placement and materials. Leveraging just a simple RGB camera, training with Phone2Proc shows massive improvements from 34.7% to 70.7% success rate in sim-to-real ObjectNav performance across a test suite of over 200 trials in diverse real-world environments, including homes, offices, and RoboTHOR. Furthermore, Phone2Proc's diverse distribution of generated scenes makes agents remarkably robust to changes in the real world, such as human movement, object rearrangement, lighting changes, or clutter.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
Retrospectives on the Embodied AI Workshop
Authors:
Matt Deitke,
Dhruv Batra,
Yonatan Bisk,
Tommaso Campari,
Angel X. Chang,
Devendra Singh Chaplot,
Changan Chen,
Claudia Pérez D'Arpino,
Kiana Ehsani,
Ali Farhadi,
Li Fei-Fei,
Anthony Francis,
Chuang Gan,
Kristen Grauman,
David Hall,
Winson Han,
Unnat Jain,
Aniruddha Kembhavi,
Jacob Krantz,
Stefan Lee,
Chengshu Li,
Sagnik Majumder,
Oleksandr Maksymets,
Roberto Martín-Martín,
Roozbeh Mottaghi
, et al. (14 additional authors not shown)
Abstract:
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of…
▽ More
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
△ Less
Submitted 4 December, 2022; v1 submitted 13 October, 2022;
originally announced October 2022.
-
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
Authors:
Matt Deitke,
Eli VanderBilt,
Alvaro Herrasti,
Luca Weihs,
Jordi Salvador,
Kiana Ehsani,
Winson Han,
Eric Kolve,
Ali Farhadi,
Aniruddha Kembhavi,
Roozbeh Mottaghi
Abstract:
Massive datasets and high-capacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories in Embodied AI. We propose ProcTHOR, a framework for procedural generation of Embodied AI environments. ProcTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, an…
▽ More
Massive datasets and high-capacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories in Embodied AI. We propose ProcTHOR, a framework for procedural generation of Embodied AI environments. ProcTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks. We demonstrate the power and potential of ProcTHOR via a sample of 10,000 generated houses and a simple neural model. Models trained using only RGB images on ProcTHOR, with no explicit mapping and no human task supervision produce state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation, including the presently running Habitat 2022, AI2-THOR Rearrangement 2022, and RoboTHOR challenges. We also demonstrate strong 0-shot results on these benchmarks, via pre-training on ProcTHOR with no fine-tuning on the downstream benchmark, often beating previous state-of-the-art systems that access the downstream training data.
△ Less
Submitted 14 June, 2022;
originally announced June 2022.
-
Visual Room Rearrangement
Authors:
Luca Weihs,
Matt Deitke,
Aniruddha Kembhavi,
Roozbeh Mottaghi
Abstract:
There has been a significant recent progress in the field of Embodied AI with researchers developing models and algorithms enabling embodied agents to navigate and interact within completely unseen environments. In this paper, we propose a new dataset and baseline models for the task of Rearrangement. We particularly focus on the task of Room Rearrangement: an agent begins by exploring a room and…
▽ More
There has been a significant recent progress in the field of Embodied AI with researchers developing models and algorithms enabling embodied agents to navigate and interact within completely unseen environments. In this paper, we propose a new dataset and baseline models for the task of Rearrangement. We particularly focus on the task of Room Rearrangement: an agent begins by exploring a room and recording objects' initial configurations. We then remove the agent and change the poses and states (e.g., open/closed) of some objects in the room. The agent must restore the initial configurations of all objects in the room. Our dataset, named RoomR, includes 6,000 distinct rearrangement settings involving 72 different object types in 120 scenes. Our experiments show that solving this challenging interactive task that involves navigation and object interaction is beyond the capabilities of the current state-of-the-art techniques for embodied tasks and we are still very far from achieving perfect performance on these types of tasks. The code and the dataset are available at: https://ai2thor.allenai.org/rearrangement
△ Less
Submitted 30 March, 2021;
originally announced March 2021.
-
RoboTHOR: An Open Simulation-to-Real Embodied AI Platform
Authors:
Matt Deitke,
Winson Han,
Alvaro Herrasti,
Aniruddha Kembhavi,
Eric Kolve,
Roozbeh Mottaghi,
Jordi Salvador,
Dustin Schwenk,
Eli VanderBilt,
Matthew Wallingford,
Luca Weihs,
Mark Yatskar,
Ali Farhadi
Abstract:
Visual recognition ecosystems (e.g. ImageNet, Pascal, COCO) have undeniably played a prevailing role in the evolution of modern computer vision. We argue that interactive and embodied visual AI has reached a stage of development similar to visual recognition prior to the advent of these ecosystems. Recently, various synthetic environments have been introduced to facilitate research in embodied AI.…
▽ More
Visual recognition ecosystems (e.g. ImageNet, Pascal, COCO) have undeniably played a prevailing role in the evolution of modern computer vision. We argue that interactive and embodied visual AI has reached a stage of development similar to visual recognition prior to the advent of these ecosystems. Recently, various synthetic environments have been introduced to facilitate research in embodied AI. Notwithstanding this progress, the crucial question of how well models trained in simulation generalize to reality has remained largely unanswered. The creation of a comparable ecosystem for simulation-to-real embodied AI presents many challenges: (1) the inherently interactive nature of the problem, (2) the need for tight alignments between real and simulated worlds, (3) the difficulty of replicating physical conditions for repeatable experiments, (4) and the associated cost. In this paper, we introduce RoboTHOR to democratize research in interactive and embodied visual AI. RoboTHOR offers a framework of simulated environments paired with physical counterparts to systematically explore and overcome the challenges of simulation-to-real transfer, and a platform where researchers across the globe can remotely test their embodied models in the physical world. As a first benchmark, our experiments show there exists a significant gap between the performance of models trained in simulation when they are tested in both simulations and their carefully constructed physical analogs. We hope that RoboTHOR will spur the next stage of evolution in embodied computer vision. RoboTHOR can be accessed at the following link: https://ai2thor.allenai.org/robothor
△ Less
Submitted 14 April, 2020;
originally announced April 2020.
-
AI2-THOR: An Interactive 3D Environment for Visual AI
Authors:
Eric Kolve,
Roozbeh Mottaghi,
Winson Han,
Eli VanderBilt,
Luca Weihs,
Alvaro Herrasti,
Matt Deitke,
Kiana Ehsani,
Daniel Gordon,
Yuke Zhu,
Aniruddha Kembhavi,
Abhinav Gupta,
Ali Farhadi
Abstract:
We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at http://ai2thor.allenai.org. AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning,…
▽ More
We introduce The House Of inteRactions (THOR), a framework for visual AI research, available at http://ai2thor.allenai.org. AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks. AI2-THOR enables research in many different domains including but not limited to deep reinforcement learning, imitation learning, learning by interaction, planning, visual question answering, unsupervised representation learning, object detection and segmentation, and learning models of cognition. The goal of AI2-THOR is to facilitate building visually intelligent models and push the research forward in this domain.
△ Less
Submitted 26 August, 2022; v1 submitted 14 December, 2017;
originally announced December 2017.