Abstract

Semester ~~Dissertation~~ ~~Project~~ Project Abstract

(Institute of Neuroinformatics, Institute of Robotics)

Depth Estimation with Neural Networks based on Stereo Vision

Pascal Simon and Michel Pescatore

Introduction

The human capability to perceive the world in three dimensions is largely based on the fact, that we have two eyes. Due to their horizontal displacement the images projected on the retina are slightly different. Nearer objects show larger disparities than others further away. Classical algorithms to compute a depth map based on stereo images exist, but they are usually developed to run most efficiently on computers and work therefore differently than their biological equivalents. Our goal is to implement a neural network, which solves the same problem but uses algorithms similar to the ones of the human brain. In fact we are not interested in a full depth image but rather in 3D coordinates of interesting points. The algorithm decides itself on which points it focuses.

Such a program could be used e.g. on a mobile robot which navigates autonomously and creates a map of important landmarks in its surroundings. Tracking and following of a moving target is another possible application. Being able to capture 3D coordinates of points and edges, single objects ~~can~~ might be recognized by their spatial ~~shape~~relations. Thus, an object categorization may also be implemented.

One of the difficulties of the algorithm is the synchronization: depth calculations can only be performed when both eyes (cameras) focus on the same point in space. The problem is that an object can show up on quite different positions and orientations in the left / right image. Illumination may also vary and furthermore there is no guarantee that an area found in the left image is also in the field of view of the right camera.

Objectives

The idea of our semester ~~dissertation~~ project is to develop an algorithm that doesn’t detect objects in a general sense but detects different interesting areas / spots in a natural environment. Interesting areas / spots are not only obvious edges of physical objects but also “objects” like shadows, spots of intense light (reflections), abrupt color changes or contrast, just everything that can be detected by “feature detectors” (Fig. 1, middle). The feature detectors contain designed matrices for detecting features in the visual input matching a specific structure (e.g. lines, circles, quadrate, triangle, etc.). The feature maps show these matches between a certain detector and a part of the image. An example is given in Fig. 2, b., with black denoting poor matching and white showing a good match. ~~Depending on the matching grade of the detected features they draw a certain color (black: no match, white: match) for every single feature detector in a separate feature map (Fig.2, b).~~ The corresponding interesting areas of those feature maps in each visual input should be able to enhance themselves and suppress the less important areas in a dynamical way. Finally, all those feature maps together are combined in a saliency map for every visual input ~~out~~ of ~~there~~ their weighted sums and, an activation function (no activation ~~under~~ below a ~~given value~~threshold). ~~and a~~ A “neighborhood” function (enhances and suppresses values regarding to there vicinity) balances the dynamics of the system, , while an implemented “inhibition of return” suppresses values once they reached a threshold, ~~i.e.~~ ~~that if the values exceed a certain threshold they should be cut off~~ (reset to a given base value) so that other interesting areas can be enhanced and detected (Fig. 2, c).

Results

Our program captures the visual input stream of multiple cameras, converts the frames (Fig. 2, a) in the desired format and splits every one of them into a single image for each color channel (RGB or HSV) (Fig. 1, top). For further processing we obtained the best result by using the V channel (intensity) only. Now the feature maps are generated by using several feature detectors (edges with horizontal, vertical and diagonal orientations and circles of different sizes) (Fig. 1, middle). At this point there should be an interaction between the corresponding feature maps of the cameras (simulating a human visual input: left and right eye) to enhance detected edges from which one believes they have the same origin. This function remains to be implemented ~~yet~~. The program’s main output is a saliency map of every camera (Fig. 1, bottom). This saliency map contains the weighted sum of all feature maps of the corresponding camera. In addition, every point of this saliency map is enhanced or suppressed by a “neighborhood” function based on the values of the current saliency map (Fig. 2, c). To speed up the image processing, our program makes extensive use of the convolution theorem, which allows us to replace many high order calculations by Fast Fourier Transforms.

Fig. 2: a) Original Image b) Feature Maps (horizontal, diagonal, circles) c) Saliency Map

Contacts: