We present a method for extracting depth information from a rectified image
pair. Our approach focuses on the first stage of many stereo algorithms: the
matching cost computation. We approach the problem by learning a similarity
measure on small image patches using a convolutional neural network. Training
is carried out in a supervised manner by constructing a binary classification
data set with examples of similar and dissimilar pairs of patches.
We examine two network architectures for learning a similarity measure on image
patches. The first architecture is faster than the second, but produces
disparity maps that are slightly less accurate. In both cases, the input to the
network is a pair of small image patches and the output is a measure of
similarity between them. Both architectures contain a trainable feature
extractor that represents each image patch with a feature vector. The
similarity between patches is measured on the feature vectors instead of the
raw image intensity values. The fast architecture uses a fixed similarity
measure to compare the two feature vectors, while the accurate architecture
attempts to learn a good similarity measure on feature vectors.
The output of the convolutional neural network is used to initialize the stereo
matching cost. A series of post-processing steps follow: cross-based cost
aggregation, semiglobal matching, a left-right consistency check, subpixel
enhancement, a median filter, and a bilateral filter.
We evaluate our method on the KITTI 2012, KITTI 2015, and Middlebury stereo
data sets and show that it outperforms other approaches on all three data sets.
|