subscribe to arXiv mailings

arXiv:2002.04479 [pdf, other]

doi 10.1007/978-3-642-33715-4_56

Depth Extraction from Video Using Non-parametric Sampling

Authors: Kevin Karsch, Ce Liu, Sing Bing Kang

Abstract: We describe a technique that automatically generates plausible depth maps from videos using non-parametric depth sampling. We demonstrate our technique in cases where past methods fail (non-translating cameras and dynamic scenes). Our technique is applicable to single images as well as videos. For videos, we use local motion cues to improve the inferred depth maps, while optical flow is used to en… ▽ More We describe a technique that automatically generates plausible depth maps from videos using non-parametric depth sampling. We demonstrate our technique in cases where past methods fail (non-translating cameras and dynamic scenes). Our technique is applicable to single images as well as videos. For videos, we use local motion cues to improve the inferred depth maps, while optical flow is used to ensure temporal depth consistency. For training and evaluation, we use a Kinect-based system to collect a large dataset containing stereoscopic videos with known depths. We show that our depth estimation technique outperforms the state-of-the-art on benchmark databases. Our technique can be used to automatically convert a monoscopic video into stereo for 3D visualization, and we demonstrate this through a variety of visually pleasing results for indoor and outdoor scenes, including results from the feature film Charade. △ Less

Submitted 24 December, 2019; originally announced February 2020.

Comments: arXiv admin note: text overlap with arXiv:2001.00987

Journal ref: ECCV 2012: Computer Vision ECCV 2012: Lecture Notes in Computer Science, vol 7576 pp 775-788

arXiv:2001.00987 [pdf, other]

doi 10.1109/TPAMI.2014.2316835

DepthTransfer: Depth Extraction from Video Using Non-parametric Sampling

Authors: Kevin Karsch, Ce Liu, Sing Bing Kang

Abstract: We describe a technique that automatically generates plausible depth maps from videos using non-parametric depth sampling. We demonstrate our technique in cases where past methods fail (non-translating cameras and dynamic scenes). Our technique is applicable to single images as well as videos. For videos, we use local motion cues to improve the inferred depth maps, while optical flow is used to en… ▽ More We describe a technique that automatically generates plausible depth maps from videos using non-parametric depth sampling. We demonstrate our technique in cases where past methods fail (non-translating cameras and dynamic scenes). Our technique is applicable to single images as well as videos. For videos, we use local motion cues to improve the inferred depth maps, while optical flow is used to ensure temporal depth consistency. For training and evaluation, we use a Kinect-based system to collect a large dataset containing stereoscopic videos with known depths. We show that our depth estimation technique outperforms the state-of-the-art on benchmark databases. Our technique can be used to automatically convert a monoscopic video into stereo for 3D visualization, and we demonstrate this through a variety of visually pleasing results for indoor and outdoor scenes, including results from the feature film Charade. △ Less

Submitted 24 December, 2019; originally announced January 2020.

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence Volume: 36 Issue: 11 pgs 2144-2158 (2014)

arXiv:2001.00986 [pdf, other]

Inverse Rendering Techniques for Physically Grounded Image Editing

Authors: Kevin Karsch

Abstract: From a single picture of a scene, people can typically grasp the spatial layout immediately and even make good guesses at materials properties and where light is coming from to illuminate the scene. For example, we can reliably tell which objects occlude others, what an object is made of and its rough shape, regions that are illuminated or in shadow, and so on. It is interesting how little is know… ▽ More From a single picture of a scene, people can typically grasp the spatial layout immediately and even make good guesses at materials properties and where light is coming from to illuminate the scene. For example, we can reliably tell which objects occlude others, what an object is made of and its rough shape, regions that are illuminated or in shadow, and so on. It is interesting how little is known about our ability to make these determinations; as such, we are still not able to robustly "teach" computers to make the same high-level observations as people. This document presents algorithms for understanding intrinsic scene properties from single images. The goal of these inverse rendering techniques is to estimate the configurations of scene elements (geometry, materials, luminaires, camera parameters, etc) using only information visible in an image. Such algorithms have applications in robotics and computer graphics. One such application is in physically grounded image editing: photo editing made easier by leveraging knowledge of the physical space. These applications allow sophisticated editing operations to be performed in a matter of seconds, enabling seamless addition, removal, or relocation of objects in images. △ Less

Submitted 24 December, 2019; originally announced January 2020.

Comments: PhD thesis, Computer Science, University of Illinois at Urbana-Champaign, 2015

arXiv:2001.00521 [pdf, other]

Lightform: Procedural Effects for Projected AR

Authors: Brittany Factura, Laura LaPerche, Phil Reyneri, Brett Jones, Kevin Karsch

Abstract: Projected augmented reality, also called projection mapping or video mapping, is a form of augmented reality that uses projected light to directly augment 3D surfaces, as opposed to using pass-through screens or headsets. The value of projected AR is its ability to add a layer of digital content directly onto physical objects or environments in a way that can be instantaneously viewed by multiple… ▽ More Projected augmented reality, also called projection mapping or video mapping, is a form of augmented reality that uses projected light to directly augment 3D surfaces, as opposed to using pass-through screens or headsets. The value of projected AR is its ability to add a layer of digital content directly onto physical objects or environments in a way that can be instantaneously viewed by multiple people, unencumbered by a screen or additional setup. Because projected AR typically involves projecting onto non-flat, textured objects (especially those that are conventionally not used as projection surfaces), the digital content needs to be mapped and aligned to precisely fit the physical scene to ensure a compelling experience. Current projected AR techniques require extensive calibration at the time of installation, which is not conducive to iteration or change, whether intentional (the scene is reconfigured) or not (the projector is bumped or settles). The workflows are undefined and fragmented, thus making it confusing and difficult for many to approach projected AR. For example, a digital artist may have the software expertise to create AR content, but could not complete an installation without experience in mounting, blending, and realigning projector(s); the converse is true for many A/V installation teams/professionals. Projection mapping has therefore been limited to high-end event productions, concerts, and films, because it requires expensive, complex tools, and skilled teams ($100K+ budgets). Lightform provides a technology that makes projected AR approachable, practical, intelligent, and robust through integrated hardware and computer-vision software. Lightform brings together and unites a currently fragmented workflow into a single cohesive process that provides users with an approachable and robust method to create and control projected AR experiences. △ Less

Submitted 24 December, 2019; originally announced January 2020.

arXiv:1912.12297 [pdf, other]

Automatic Scene Inference for 3D Object Compositing

Authors: Kevin Karsch, Kalyan Sunkavalli, Sunil Hadap, Nathan Carr, Hailin Jin, Rafael Fonte, Michael Sittig

Abstract: We present a user-friendly image editing system that supports a drag-and-drop object insertion (where the user merely drags objects into the image, and the system automatically places them in 3D and relights them appropriately), post-process illumination editing, and depth-of-field manipulation. Underlying our system is a fully automatic technique for recovering a comprehensive 3D scene model (geo… ▽ More We present a user-friendly image editing system that supports a drag-and-drop object insertion (where the user merely drags objects into the image, and the system automatically places them in 3D and relights them appropriately), post-process illumination editing, and depth-of-field manipulation. Underlying our system is a fully automatic technique for recovering a comprehensive 3D scene model (geometry, illumination, diffuse albedo and camera parameters) from a single, low dynamic range photograph. This is made possible by two novel contributions: an illumination inference algorithm that recovers a full lighting model of the scene (including light sources that are not directly visible in the photograph), and a depth estimation algorithm that combines data-driven depth transfer with geometric reasoning about the scene layout. A user study shows that our system produces perceptually convincing results, and achieves the same level of realism as techniques that require significant user interaction. △ Less