The ability to work out 3D information about the scene, called depth sensing, may be a valuable tool for developers and users alike. Depth sensing may be a very active area of computer vision research with recent innovations starting from applications like portrait mode and AR to fundamental sensing innovations like transparent object detection. Typical RGB-based stereo depth sensing techniques are often computationally expensive, suffer in regions with low texture, and fail completely in extreme low light conditions.
Because the Face Unlock feature on Pixel 4 must work on high speed and darkly, it involved a special approach. to the present end, the front of the Pixel 4 contains a real-time infrared (IR) active stereo depth sensor, called uDepth. A key computer vision capability on the Pixel 4, this technology helps the authentication system identify the user while also protecting against spoof attacks. It also supports a variety of novel capabilities, like after-the-fact photo retouching, depth-based segmentation of a scene, background blur, portrait effects and 3D photos.
Recently, we provided access to uDepth as an API on Camera2, using the Pixel Neural Core, two IR cameras, and an IR pattern projector to supply time-synchronized depth frames (in DEPTH16) at 30Hz. The Google Camera App uses this API to bring improved depth capabilities to selfies taken on the Pixel 4. during this post, we explain broadly how uDepth works, elaborate on the underlying algorithms, and discuss applications with example results for the Pixel 4.
Overview of Stereo Depth Sensing
All stereo camera systems reconstruct depth using parallax. to watch this effect, check out an object, close one eye, then switch which eye is closed. The apparent position of the thing will shift, with closer objects appearing to maneuver more. uDepth is a component of the family of dense local stereo matching techniques, which estimate parallax computationally for every pixel. These techniques evaluate a neighborhood surrounding each pixel within the image formed by one camera, and check out to seek out an identical region within the corresponding image from the second camera. When calibrated properly, the reconstructions generated are metric, meaning that they express real physical distances.
To affect textureless regions and deal with low-light conditions, we make use of an “active stereo” setup, which projects an IR pattern into the scene that’s detected by stereo IR cameras. This approach makes low-texture regions easier to spot, improving results and reducing the computational requirements of the system.
What Makes uDepth Distinct?
Stereo sensing systems are often extremely computationally intensive, and it’s critical that a sensor running at 30Hz is low power while remaining top quality. uDepth leverages a variety of key insights to accomplish this.
One such insight is that given a pair of regions that are almost like one another, the most corresponding subsets of these regions also are similar. for instance, given two 8×8 patches of pixels that are similar, it’s very likely that the top-left 4×4 sub-region of every member of the pair is additionally similar. This informs the uDepth pipeline’s initialization procedure, which builds a pyramid of depth proposals by comparison of non-overlapping tiles in each image and selecting those most similar. This process starts with 1×1 tiles, and accumulates support hierarchically until an initial low-resolution depth map is generated.
After initialization, we apply a completely unique technique for neural depth refinement to support the regular grid pattern illuminator on the Pixel 4. Typical active stereo systems project a pseudo-random grid pattern to assist disambiguate matches within the scene, but uDepth is capable of supporting repeating grid patterns also. Repeating structure in such patterns produces regions that look similar across stereo pairs, which may cause incorrect matches. We mitigate this issue by employing a lightweight (75k parameter) convolutional architecture, using IR brightness and neighbor information to regulate incorrect matches — in but 1.5ms per frame.
Following neural depth refinement, good depth estimates are iteratively propagated from neighboring tiles. This and following pipeline steps leverage another insight key to the success of uDepth — natural scenes are typically locally planar with only small nonplanar deviations. this allows us to seek out planar tiles that cover the scene and only later refine individual depths for every pixel during a tile, greatly reducing computational load.
Finally, the simplest match from among neighboring plane hypotheses is chosen, with subpixel refinement and invalidation if no good match might be found.
When a phone experiences a severe drop, it may result in the factory calibration of the stereo cameras diverging from the particular position of the cameras. to make sure high-quality results during real-world use, the uDepth system is self-calibrating. A scoring routine evaluates every depth image for signs of miscalibration, and builds up confidence within the state of the device. If miscalibration is detected, calibration parameters are regenerated from the present scene. This follows a pipeline consisting of feature detection and correspondence, subpixel refinement (taking advantage of the dot profile), and bundle adjustment.
Depth for Computational Photography
The data from the uDepth sensor is meant to be accurate and metric, which may be a fundamental requirement for Face Unlock. Computational photography applications like portrait mode and 3D photos have very different needs. In these use cases, it’s not critical to realize video frame rates, but the depth should be smooth, edge-aligned and complete within the whole field-of-view of the color camera.
To achieve this we trained an end-to-end deep learning architecture that enhances the raw uDepth data, inferring an entire, dense 3D depth map. We use a mixture of RGB images, people segmentation, and raw depth, with a dropout scheme forcing use of data for every one of the inputs.
To acquire ground truth, we leveraged a volumetric capture system that will produce near-photorealistic models of individuals employing a geodesic sphere outfitted with 331 custom color LED lights, an array of high-resolution cameras, and a group of custom high-resolution depth sensors. We added Pixel 4 phones to the setup and synchronized them with the remainder of the hardware (lights and cameras). The generated training data consists of a mixture of real images also as synthetic renderings from the Pixel 4 camera viewpoint.
Finally, we are happy to supply a demo application for you to play thereupon visualizes a real-time point cloud from uDepth — download it here (this app is for demonstration and research purposes only and not intended for commercial use; Google won’t provide any support or updates). This demo app visualizes 3D point clouds from your Pixel 4 device. Because the depth maps are time-synchronized and within the same frame of reference because the RGB images, a textured view of the 3D scene are often shown, as within the example visualization below: