Semester Dissertation
Project Project Abstract
(Institute of Neuroinformatics, Institute of
Robotics)
Depth Estimation with Neural Networks based on Stereo Vision
by
Pascal Simon and Michel Pescatore
Introduction
The human capability to perceive the world in three
dimensions is largely based on the fact, that we
have two eyes. Due to their horizontal displacement the images projected on the
retina are slightly different. Nearer objects show larger disparities than
others further away. Classical algorithms to compute a depth map based on
stereo images exist, but they are usually developed to run most efficiently on
computers and work therefore differently than their biological equivalents. Our
goal is to implement a neural network, which solves the same problem but uses
algorithms similar to the ones of the human brain. In fact we are not
interested in a full depth image but rather in 3D coordinates of interesting
points. The algorithm decides itself on
which points it focuses.
Such a program could be used e.g. on a mobile robot
which navigates autonomously and creates a map of important landmarks in its
surroundings. Tracking and following of a moving target is another possible
application. Being able to capture 3D coordinates of points and edges, single
objects can might be recognized by their spatial shaperelations. Thus, an object
categorization may also be implemented.
One of the difficulties of the algorithm is the
synchronization: depth calculations can only be performed when both eyes
(cameras) focus on the same point in space. The problem is that an object can
show up on quite different positions and orientations in the left / right
image. Illumination may also vary and furthermore there is no guarantee that an
area found in the left image is also in the field of view of the right camera.
Objectives
The idea of our semester
dissertation
project
is to develop an algorithm that doesn’t detect objects in a general sense
but detects different interesting areas / spots in a natural environment. Interesting
areas / spots are not only obvious edges of physical objects but also “objects”
like shadows, spots of intense light (reflections), abrupt color changes or
contrast, just everything that can be detected by “feature detectors” (Fig. 1, middle). The feature detectors
contain designed matrices for detecting features in the visual input matching a
specific structure (e.g. lines, circles, quadrate, triangle, etc.). The feature maps show these matches between a certain detector and a part of the image. An example is given in Fig. 2, b., with black denoting poor
matching and
white
showing a good match. Depending
on the matching grade of the detected features they draw a certain color
(black: no match, white: match) for every single feature detector in a separate
feature map (Fig.2, b). The corresponding interesting areas of
those feature maps in each visual input should be able to enhance themselves
and suppress the less important areas in a dynamical way. Finally, all those
feature maps together are combined in a saliency map for every visual input out of
there
their
weighted sums
and, an activation function (no
activation under below a given valuethreshold). and
a A “neighborhood”
function (enhances and suppresses values regarding to there vicinity) balances the dynamics
of the system, , while an implemented “inhibition of return”
suppresses values once they reached a threshold, i.e. that
if the values exceed a certain threshold they should be cut off (reset
to a given base value) so that other interesting areas can be enhanced and
detected (Fig. 2, c).
Results
Our program captures the visual input stream of
multiple cameras, converts the frames (Fig. 2, a) in the desired format and
splits every one of them into a single image for each color channel (RGB or
HSV) (Fig. 1, top). For further processing we obtained the best result by using
the V channel (intensity) only. Now the feature maps are generated by using
several feature detectors (edges with horizontal, vertical and diagonal
orientations and circles of different sizes) (Fig. 1, middle). At this point
there should be an interaction between the corresponding feature maps of the
cameras (simulating a human visual input: left and right eye) to enhance
detected edges from which one believes they have the same origin. This function
remains to be implemented yet. The program’s main output is a
saliency map of every camera (Fig. 1, bottom). This saliency map contains the
weighted sum of all feature maps of the corresponding camera. In addition,
every point of this saliency map is enhanced or suppressed by a “neighborhood”
function based on the values of the current saliency map (Fig. 2, c). To speed up the image processing, our program
makes extensive use of the convolution theorem, which allows us to replace many high order
calculations by Fast Fourier Transforms.
Fig. 2: a)
Original Image b) Feature
Maps (horizontal, diagonal, circles) c) Saliency Map
Contacts:
Pascal Simon, D-MAVT
psimon@student.ethz.ch
Michel Pescatore, D-MAVT
michelp@student.ethz.ch
Advisor:
Jorg Conradt
conradt@ini.phys.ethz.ch (sorry, ich wollte nicht ganz runter, nur nach
hinten ;-)