Computer Science > Artificial Intelligence

arXiv:2301.12507 (cs)

[Submitted on 29 Jan 2023 (v1), last revised 14 Jun 2023 (this version, v2)]

Title:Distilling Internet-Scale Vision-Language Models into Embodied Agents

Authors:Theodore Sumers, Kenneth Marino, Arun Ahuja, Rob Fergus, Ishita Dasgupta

View PDF

Abstract:Instruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To address this challenge, we propose using pretrained vision-language models (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight experience replay (HER), using a VLM to retroactively generate language describing the agent's behavior. Simple prompting allows us to control the supervision signal, teaching an agent to interact with novel objects based on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered environment. Fewshot prompting lets us teach abstract category membership, including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary preferences over objects). Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.

Comments:	9 pages, 7 figures. Presented at ICML 2023
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2301.12507 [cs.AI]
	(or arXiv:2301.12507v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2301.12507

Submission history

From: Theodore Sumers [view email]
[v1] Sun, 29 Jan 2023 18:21:05 UTC (7,043 KB)
[v2] Wed, 14 Jun 2023 14:04:50 UTC (7,055 KB)

Computer Science > Artificial Intelligence

Title:Distilling Internet-Scale Vision-Language Models into Embodied Agents

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Distilling Internet-Scale Vision-Language Models into Embodied Agents

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators