Skip to main content

Showing 1–32 of 32 results for author: Sermanet, P

  1. arXiv:2403.12943  [pdf, other

    cs.RO cs.AI

    Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

    Authors: Vidhi Jain, Maria Attarian, Nikhil J Joshi, Ayzaan Wahid, Danny Driess, Quan Vuong, Pannag R Sanketi, Pierre Sermanet, Stefan Welker, Christine Chan, Igor Gilitschenski, Yonatan Bisk, Debidatta Dwibedi

    Abstract: While large-scale robotic systems typically rely on textual instructions for tasks, this work explores a different approach: can robots infer the task directly from observing humans? This shift necessitates the robot's ability to decode human intent and translate it into executable actions within its physical constraints and environment. We introduce Vid2Robot, a novel end-to-end video-based learn… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: Robot learning: Imitation Learning, Robot Perception, Sensing & Vision, Grasping & Manipulation

  2. arXiv:2403.01823  [pdf, other

    cs.RO cs.AI

    RT-H: Action Hierarchies Using Language

    Authors: Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, Dorsa Sadigh

    Abstract: Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., "pick coke can" and "pick an apple") in… ▽ More

    Submitted 31 May, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  3. arXiv:2401.12963  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents

    Authors: Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Sean Kirmani, Isabel Leal, Edward Lee, Sergey Levine, Yao Lu, Isabel Leal, Sharath Maddineni, Kanishka Rao, Dorsa Sadigh, Pannag Sanketi, Pierre Sermanet, Quan Vuong, Stefan Welker, Fei Xia, Ted Xiao , et al. (3 additional authors not shown)

    Abstract: Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the d… ▽ More

    Submitted 1 July, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

    Comments: 26 pages, 9 figures, ICRA 2024 VLMNM Workshop

  4. arXiv:2311.00899  [pdf, other

    cs.RO

    RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

    Authors: Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, Pete Florence, Wei Han, Robert Baruch, Yao Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman, Izhak Shafran, Brian Ichter, Yuan Cao

    Abstract: We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple robot and human embodiment… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

  5. arXiv:2310.10625  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Video Language Planning

    Authors: Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, Jonathan Tompson

    Abstract: We are interested in enabling visual planning for complex long-horizon tasks in the space of generated videos and language, leveraging recent advances in large generative models pretrained on Internet-scale data. To this end, we present video language planning (VLP), an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

    Comments: https://video-language-planning.github.io/

  6. arXiv:2310.08864  [pdf, other

    cs.RO

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (267 additional authors not shown)

    Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More

    Submitted 1 June, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: Project website: https://robotics-transformer-x.github.io

  7. Robotic Table Tennis: A Case Study into a High Speed Learning System

    Authors: David B. D'Ambrosio, Jonathan Abelian, Saminda Abeyruwan, Michael Ahn, Alex Bewley, Justin Boyd, Krzysztof Choromanski, Omar Cortes, Erwin Coumans, Tianli Ding, Wenbo Gao, Laura Graesser, Atil Iscen, Navdeep Jaitly, Deepali Jain, Juhana Kangaspunta, Satoshi Kataoka, Gus Kouretas, Yuheng Kuang, Nevena Lazic, Corey Lynch, Reza Mahjourian, Sherry Q. Moore, Thinh Nguyen, Ken Oslund , et al. (10 additional authors not shown)

    Abstract: We present a deep-dive into a real-world robotic learning system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real w… ▽ More

    Submitted 6 September, 2023; originally announced September 2023.

    Comments: Published and presented at Robotics: Science and Systems (RSS2023)

  8. arXiv:2307.15818  [pdf, other

    cs.RO cs.CL cs.CV cs.LG

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Authors: Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal , et al. (29 additional authors not shown)

    Abstract: We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web.… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

    Comments: Website: https://robotics-transformer.github.io/

  9. arXiv:2303.03378  [pdf, other

    cs.LG cs.AI cs.RO

    PaLM-E: An Embodied Multimodal Language Model

    Authors: Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence

    Abstract: Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model ar… ▽ More

    Submitted 6 March, 2023; originally announced March 2023.

  10. arXiv:2211.11736  [pdf, other

    cs.RO cs.AI cs.LG

    Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models

    Authors: Ted Xiao, Harris Chan, Pierre Sermanet, Ayzaan Wahid, Anthony Brohan, Karol Hausman, Sergey Levine, Jonathan Tompson

    Abstract: In recent years, much progress has been made in learning robotic manipulation policies that follow natural language instructions. Such methods typically learn from corpora of robot-language data that was either collected with specific tasks in mind or expensively re-labelled by humans with rich language descriptions in hindsight. Recently, large-scale pretrained vision-language models (VLMs) like… ▽ More

    Submitted 1 July, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

    Comments: Published as a conference paper at RSS 2023

  11. arXiv:2210.03662  [pdf, other

    cs.RO

    GoalsEye: Learning High Speed Precision Table Tennis on a Physical Robot

    Authors: Tianli Ding, Laura Graesser, Saminda Abeyruwan, David B. D'Ambrosio, Anish Shankar, Pierre Sermanet, Pannag R. Sanketi, Corey Lynch

    Abstract: Learning goal conditioned control in the real world is a challenging open problem in robotics. Reinforcement learning systems have the potential to learn autonomously via trial-and-error, but in practice the costs of manual reward design, ensuring safe exploration, and hyperparameter tuning are often enough to preclude real world deployment. Imitation learning approaches, on the other hand, offer… ▽ More

    Submitted 13 October, 2022; v1 submitted 7 October, 2022; originally announced October 2022.

  12. arXiv:2207.05608  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Authors: Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, Brian Ichter

    Abstract: Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to… ▽ More

    Submitted 12 July, 2022; originally announced July 2022.

    Comments: Project website: https://innermonologue.github.io

  13. arXiv:2205.06333  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations

    Authors: Negin Heravi, Ayzaan Wahid, Corey Lynch, Pete Florence, Travis Armstrong, Jonathan Tompson, Pierre Sermanet, Jeannette Bohg, Debidatta Dwibedi

    Abstract: Perceptual understanding of the scene and the relationship between its different components is important for successful completion of robotic tasks. Representation learning has been shown to be a powerful technique for this, but most of the current methodologies learn task specific representations that do not necessarily transfer well to other tasks. Furthermore, representations learned by supervi… ▽ More

    Submitted 12 March, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

  14. arXiv:2204.01691  [pdf, other

    cs.RO cs.CL cs.LG

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Authors: Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee , et al. (20 additional authors not shown)

    Abstract: Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embo… ▽ More

    Submitted 16 August, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

    Comments: See website at https://say-can.github.io/ V1. Initial Upload. V2. Added PaLM results. Added study about new capabilities (drawer manipulation, chain of thought prompting, multilingual instructions). Added an ablation study of language model size. Added an open-source version of \algname on a simulated tabletop environment. Improved readability

  15. arXiv:2104.14548  [pdf, other

    cs.CV

    With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations

    Authors: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

    Abstract: Self-supervised learning algorithms based on instance discrimination train encoders to be invariant to pre-defined transformations of the same instance. While most methods treat different views of the same image as positives for a contrastive loss, we are interested in using positives from other instances in the dataset. Our method, Nearest-Neighbor Contrastive Learning of visual Representations (… ▽ More

    Submitted 7 October, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

    Comments: Accepted at ICCV 2021

  16. arXiv:2010.06491  [pdf, other

    cs.RO cs.LG

    Broadly-Exploring, Local-Policy Trees for Long-Horizon Task Planning

    Authors: Brian Ichter, Pierre Sermanet, Corey Lynch

    Abstract: Long-horizon planning in realistic environments requires the ability to reason over sequential tasks in high-dimensional state spaces with complex dynamics. Classical motion planning algorithms, such as rapidly-exploring random trees, are capable of efficiently exploring large state spaces and computing long-horizon, sequential plans. However, these algorithms are generally challenged with complex… ▽ More

    Submitted 13 October, 2020; originally announced October 2020.

  17. arXiv:2006.15418  [pdf, other

    cs.CV

    Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

    Authors: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

    Abstract: We present an approach for estimating the period with which an action is repeated in a video. The crux of the approach lies in constraining the period prediction module to use temporal self-similarity as an intermediate representation bottleneck that allows generalization to unseen repetitions in videos in the wild. We train this model, called Repnet, with a synthetic dataset that is generated fro… ▽ More

    Submitted 27 June, 2020; originally announced June 2020.

    Comments: Accepted at CVPR 2020. Project webpage: https://sites.google.com/view/repnet

  18. arXiv:2006.06874  [pdf, other

    cs.RO cs.AI cs.LG eess.SY

    Learning to Play by Imitating Humans

    Authors: Rostam Dinyari, Pierre Sermanet, Corey Lynch

    Abstract: Acquiring multiple skills has commonly involved collecting a large number of expert demonstrations per task or engineering custom reward functions. Recently it has been shown that it is possible to acquire a diverse set of skills by self-supervising control on top of human teleoperated play data. Play is rich in state space coverage and a policy trained on this data can generalize to specific task… ▽ More

    Submitted 11 June, 2020; originally announced June 2020.

  19. arXiv:2006.00545  [pdf, other

    cs.RO cs.CV

    Motion2Vec: Semi-Supervised Representation Learning from Surgical Videos

    Authors: Ajay Kumar Tanwani, Pierre Sermanet, Andy Yan, Raghav Anand, Mariano Phielipp, Ken Goldberg

    Abstract: Learning meaningful visual representations in an embedding space can facilitate generalization in downstream tasks such as action segmentation and imitation. In this paper, we learn a motion-centric representation of surgical video demonstrations by grouping them into action segments/sub-goals/options in a semi-supervised manner. We present Motion2Vec, an algorithm that learns a deep embedding fea… ▽ More

    Submitted 31 May, 2020; originally announced June 2020.

    Comments: IEEE International Conference on Robotics and Automation (ICRA), 2020

  20. arXiv:2005.07648  [pdf, other

    cs.RO cs.AI cs.CL cs.CV

    Language Conditioned Imitation Learning over Unstructured Data

    Authors: Corey Lynch, Pierre Sermanet

    Abstract: Natural language is perhaps the most flexible and intuitive way for humans to communicate tasks to a robot. Prior work in imitation learning typically requires each task be specified with a task id or goal image -- something that is often impractical in open-world environments. On the other hand, previous approaches in instruction following allow agent behavior to be guided by language, but typica… ▽ More

    Submitted 7 July, 2021; v1 submitted 15 May, 2020; originally announced May 2020.

    Comments: Published at RSS 2021

  21. arXiv:1906.04312  [pdf, other

    cs.CV cs.LG cs.RO

    Online Object Representations with Contrastive Learning

    Authors: Sören Pirk, Mohi Khansari, Yunfei Bai, Corey Lynch, Pierre Sermanet

    Abstract: We propose a self-supervised approach for learning representations of objects from monocular videos and demonstrate it is particularly useful in situated settings such as robotics. The main contributions of this paper are: 1) a self-supervising objective trained with contrastive learning that can discover and disentangle object attributes from video without using any labels; 2) we leverage object… ▽ More

    Submitted 10 June, 2019; originally announced June 2019.

    Comments: 10 pages

  22. arXiv:1904.07846  [pdf, other

    cs.CV cs.LG

    Temporal Cycle-Consistency Learning

    Authors: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

    Abstract: We introduce a self-supervised representation learning method based on the task of temporal alignment between videos. The method trains a network using temporal cycle consistency (TCC), a differentiable cycle-consistency loss that can be used to find correspondences across time in multiple videos. The resulting per-frame embeddings can be used to align videos by simply matching frames using the ne… ▽ More

    Submitted 16 April, 2019; originally announced April 2019.

    Comments: Accepted at CVPR 2019. Project webpage: https://sites.google.com/view/temporal-cycle-consistency

  23. arXiv:1903.11780  [pdf, other

    cs.LG stat.ML

    Wasserstein Dependency Measure for Representation Learning

    Authors: Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron van den Oord, Sergey Levine, Pierre Sermanet

    Abstract: Mutual information maximization has emerged as a powerful learning objective for unsupervised representation learning obtaining state-of-the-art performance in applications such as object recognition, speech recognition, and reinforcement learning. However, such approaches are fundamentally limited since a tight lower bound of mutual information requires sample size exponential in the mutual infor… ▽ More

    Submitted 27 March, 2019; originally announced March 2019.

  24. arXiv:1903.01973  [pdf, other

    cs.RO

    Learning Latent Plans from Play

    Authors: Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, Pierre Sermanet

    Abstract: Acquiring a diverse repertoire of general-purpose skills remains an open challenge for robotics. In this work, we propose self-supervising control on top of human teleoperated play data as a way to scale up skill learning. Play has two properties that make it attractive compared to conventional task demonstrations. Play is cheap, as it can be collected in large quantities quickly without task segm… ▽ More

    Submitted 20 December, 2019; v1 submitted 5 March, 2019; originally announced March 2019.

    Comments: Published at CoRL 2019 (3rd Conference on Robot Learning, Osaka, Japan)

  25. arXiv:1808.00928  [pdf, other

    cs.CV cs.LG cs.RO

    Learning Actionable Representations from Visual Observations

    Authors: Debidatta Dwibedi, Jonathan Tompson, Corey Lynch, Pierre Sermanet

    Abstract: In this work we explore a new approach for robots to teach themselves about the world simply by observing it. In particular we investigate the effectiveness of learning task-agnostic representations for continuous control tasks. We extend Time-Contrastive Networks (TCN) that learn from visual observations by embedding multiple frames jointly in the embedding space as opposed to a single frame. We… ▽ More

    Submitted 2 February, 2019; v1 submitted 2 August, 2018; originally announced August 2018.

    Comments: This work is accepted in IROS 2018. Project website: https://sites.google.com/view/actionablerepresentations

  26. arXiv:1704.06888  [pdf, other

    cs.CV cs.RO

    Time-Contrastive Networks: Self-Supervised Learning from Video

    Authors: Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine

    Abstract: We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that captu… ▽ More

    Submitted 19 March, 2018; v1 submitted 23 April, 2017; originally announced April 2017.

  27. arXiv:1612.06699  [pdf, other

    cs.CV cs.RO

    Unsupervised Perceptual Rewards for Imitation Learning

    Authors: Pierre Sermanet, Kelvin Xu, Sergey Levine

    Abstract: Reward function design and exploration time are arguably the biggest obstacles to the deployment of reinforcement learning (RL) agents in the real world. In many real-world tasks, designing a reward function takes considerable hand engineering and often requires additional sensors to be installed just to measure whether the task has been executed successfully. Furthermore, many interesting tasks c… ▽ More

    Submitted 12 June, 2017; v1 submitted 20 December, 2016; originally announced December 2016.

  28. arXiv:1412.7054  [pdf, other

    cs.CV cs.LG cs.NE

    Attention for Fine-Grained Categorization

    Authors: Pierre Sermanet, Andrea Frome, Esteban Real

    Abstract: This paper presents experiments extending the work of Ba et al. (2014) on recurrent neural models for attention into less constrained visual environments, specifically fine-grained categorization on the Stanford Dogs data set. In this work we use an RNN of the same structure but substitute a more powerful visual network and perform large-scale pre-training of the visual network outside of the atte… ▽ More

    Submitted 10 April, 2015; v1 submitted 22 December, 2014; originally announced December 2014.

    Comments: ICLR 2015 Workshop

  29. arXiv:1409.4842  [pdf, other

    cs.CV

    Going Deeper with Convolutions

    Authors: Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich

    Abstract: We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully c… ▽ More

    Submitted 16 September, 2014; originally announced September 2014.

  30. arXiv:1312.6229  [pdf, ps, other

    cs.CV

    OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

    Authors: Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun

    Abstract: We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to incr… ▽ More

    Submitted 23 February, 2014; v1 submitted 21 December, 2013; originally announced December 2013.

  31. arXiv:1212.0142  [pdf, ps, other

    cs.CV cs.LG

    Pedestrian Detection with Unsupervised Multi-Stage Feature Learning

    Authors: Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, Yann LeCun

    Abstract: Pedestrian detection is a problem of considerable practical interest. Adding to the list of successful applications of deep learning methods to vision, we report state-of-the-art and competitive results on all major pedestrian datasets with a convolutional network model. The model uses a few new twists, such as multi-stage features, connections that skip layers to integrate global shape informatio… ▽ More

    Submitted 2 April, 2013; v1 submitted 1 December, 2012; originally announced December 2012.

    Comments: 12 pages

  32. arXiv:1204.3968  [pdf, ps, other

    cs.CV cs.LG cs.NE

    Convolutional Neural Networks Applied to House Numbers Digit Classification

    Authors: Pierre Sermanet, Soumith Chintala, Yann LeCun

    Abstract: We classify digits of real-world house numbers using convolutional neural networks (ConvNets). ConvNets are hierarchical feature learning neural networks whose structure is biologically inspired. Unlike many popular vision approaches that are hand-designed, ConvNets can automatically learn a unique set of features optimized for a given task. We augmented the traditional ConvNet architecture by lea… ▽ More

    Submitted 17 April, 2012; originally announced April 2012.

    Comments: 4 pages, 6 figures, 2 tables