Printer Friendly

Scene Structure Classification as Preprocessing for Feature-Based Visual Odometry.


Real-time image processing has been enabled in recent years by advances in embedded hardware and open-source software libraries. Autonomous vehicles and advanced driver assistance systems require the utilization of a suite of sensors, which likely include cameras since they can enable object recognition and tracking, as well as lane keeping, localization, and path planning in GPS-denied environments [1, 2, 2, 4, 5, 6, 7]. One of the key requirements for many modern automotive applications is the accurate estimation of motion, and VO is one of the best methods to achieve this. VO is the process of estimating the position and orientation of an agent (e.g., a vehicle) using only the input of a single camera or multiple attached cameras [8].

VO approaches are classified as either monocular VO when one camera is used or stereo VO when more than one is used. By relying on only a single camera for its operation, monocular VO is inherently a more difficult problem. One of the major problems in monocular VO is the scale ambiguity [9], which means that the motion trajectory can be estimated with only an ambiguous scale factor. On the other hand, by relying on the principles of stereo vision, stereo VO avoids such problems and is capable of achieving superior results. For example, with a known baseline distance between cameras, stereo VO is able to estimate the exact motion trajectory and thus does not suffer from the scale ambiguity problem. In this article, we present an overview of our stereo VO algorithm and investigate how content-based preprocessing affects the algorithm's performance. The problem of self-localization and motion estimation of an agent using only the visual input started in the early 1980s with the seminal work of Moravec [10]; however, it was Nister et al. who coined the term visual odometry in their landmark paper [11]. A review of the field was recently published by Scaramuzza et al. [8].

This article expands on our previous work on stereo VO [2,12, 13]. We investigate the effect of structural image content in the video frames on the number of detected vs. tracked VO features for common algorithms. The goal is to identify regions in the frame sequence where many corner points are lost and modify the VO algorithm to not process them. The classification preprocessing step, described in the next section, segments the video frame into tiles, which are labeled as either being of the Random, Texture, or Transient class.

Previous research has considered scene content in the context of VO. A common problem is accounting for moving objects in order to correctly track stationary feature points from the environment to produce motion estimates relative to the environment. For example, Chang et al. implement imminent collision detection using stereo vision via algorithms for object detection in depth images and for estimating relative vehicle travel velocity [4]. Another work implements a driver assist solution that analyzes traffic using a stereo camera system to extract 3D information of surrounding vehicles, and its performance is compared to that of other environment sensory systems [14]. In other research, a simultaneous location and mapping (SLAM) implementation for dynamic road environments works by extracting a disparity map and computing optical flow in a stereo camera configuration [15]. The disparity change, i.e., the estimated flow field between stereo views, is then computed, followed by the definition of regions of interest containing moving objects. These approaches exemplify region-of-interest masking based on motion in the scene. Another common approach to scene analysis is object recognition. For example, another work implements a stereo vision algorithm that utilizes object knowledge to classify objects such as cars, non-textured surfaces, and reflecting surfaces [5]. The approach uses a nonlocal regularizer and is based on a sparse disparity estimate and a semantic segmentation of the image. In addition, detecting streets in urban settings is achieved by modeling the facades of buildings as planar surfaces and estimating their parameters based on a dense disparity map [6]. Buildings are detected leading to an estimation of street location. Stereo vision systems are also used in driver assist solutions in urban environments to estimate the locations of road, obstacles, and other vehicles [16, 7]. Our approach differs in classifying images based on strictly structural scene content, judging by the presence of edges, periodicities, and flat areas.

Other notable related works are the machine learning algorithms used within corner detectors, such as the features from accelerated segment test (FAST) algorithm [24]. Typically, the machine learning is applied to corner points based on features in their surrounding patches. In this article, we explore applying machine learning on image regions at a higher level, where corners may have high-quality features but are, however, ambiguous to many other surrounding points depending on the scene-on trees, for example-which hurts ego-motion estimation and VO.

Scene Structure Classification

An image content classifier by structural content was used to segment the image into tiles of 64 x 64 pixels and classify these as either Random, Texture, or Transient regions [17]. The classifier has an estimation error of 13.3% and is based on calculation of six features for each 64 x 64 pixel image tile. These features are as follows:

* The variance of the RMS-normalized intensity of pixels in the grayscale video frame

* The variance of the fourth Dom energies over 6 Fan directions in the cortex filter [18]

* Dynamic range of the third and fourth Dom energies over 6 Fan directions in the cortex filter [18]

* Dynamic range of 6 Fan energy slopes in the cortex filter [18]

* Maximum of baseband filtered tile optical density histogram

The cortex filter decomposes an image, or image tile, into radial spatial frequency bands according to edge orientation. It mimics how the human visual system detects structure in seen images. Figure 1 shows the filter decomposition bands with an example of the regions with energy in the third Dom across all six orientations. The linear classifier features use metrics derived from the cortex filter decomposition to differentiate between the three classes [17].

An example of applying the cortex filter to image tiles from the three classes is shown in Figure 2. The decomposition differs qualitatively across the classes. For the Random-class tile (at the top), the energy is relatively low and uniform outside of the baseband. The Texture-class image (middle row) has a similar uniform energy distribution with, however, more low-frequency content. This means that Fan filter energies exhibit a sharper roll-off for the Texture compared to the Random case. For the Transient-class image region (bottom row), the energy is directional and normal to the edges with notable high-frequency content. The directionality of energy distribution can be measured by the variation in Dom energies in this case.

Tiles classified as Transient contain edges of high dark-light contrast in various directions: horizontal, vertical, and diagonal. Examples include text of a road sign and a shadow line of a light building wall. Image regions classified as Texture contain spatial periodicities, i.e., regular patterns, such as the boards of a wooden fence or the lines of parked cars. Finally, tiles of the Random class contain flat areas, such as the sky, and areas that do not have sharp edges or periodicities. This classifier is applied as a preprocessing step to the video frames captured by the onboard camera.

After classification, the VO algorithm proceeds to calculate corner points in each class and track them over the video frames. From the successfully tracked corners, it is possible to estimate the movement of the camera, as we do in our VO algorithm. Several feature extractors are common in the literature, including adaptive and generic accelerated segment test (AGAST) [19], SIFT [21], SURF [22], FAST [23], ORB [24], BRISK [25], and Shi-Tomasi [26]. Matching between extracted corners can be performed by means of feature descriptors. A feature descriptor computed for each corner acts as a signature for that corner. SIFT, SURF, ORB, and BRISK extract both features and their descriptors. On the other hand, AGAST, FAST, and Shi-Tomasi are only corner detectors. Thus, we use binary robust independent elementary features (BRIEF) [20] descriptors. BRIEF descriptors are chosen because of their relatively faster performance; and they are binary feature descriptors, i.e., descriptors are in the form of a binary string whose matching is a fast process using Hamming distance. Hamming distance is defined as the number of positions at which the corresponding bits are different. We evaluate structural scene content in each of these feature extraction methods.

Figure 3 shows an example frame from the KITTI dataset with detected and tracked AGAST corners. It shows that many of the detected corners are not tracked as the video runs. Figure 4 shows how the scene was classified into the three classes, Random, Texture, and Transient. We investigate whether the number of tracked corners varies as a function of scene structural content, i.e., class.

Algorithm Description

The VO component in this article is treated as a blackbox and is described in our prior work [12]. The algorithm, named Lightweight Visual Tracking (LVT), is available in open-source form and is compatible with the Robot Operating System (ROS). At a high level, our algorithm triangulates 3D map points by corresponding extracted corners between the two stereo images. Then, these map points are tracked overtime as long as possible in subsequent frames until they leave the field of view or they are no longer found (fail to match). Traditionally, VO systems relied for their operation on estimating the motion only between consecutive frames, i.e., from frame to frame. Only recently has the full history of the tracked corner points been utilized for VO [28], where the authors utilize optical flow in tracking corner points across frames and then use the full corner point history in motion estimation. In our VO algorithm, we also make use of the full history of corners in motion estimation; however, we rely on tracking a sparse local 3D map. That is, as long as any 3D point can be successfully matched and associated with a detected 2D corner feature point in the current frame, then it is kept in the map to be used for future estimations; otherwise, this 3D map point is culled and removed from the local map. Successful correspondences between detected corner points in each frame and the 3D map are used in estimating the six-degrees-of-freedom pose (position and orientation) by solving an optimization problem where the objective is to find the optimal pose that minimizes the image re-projection error, as shown in Equation 1:

[mathematical expression not reproducible] Eq. (1)

where [x.sup.t] [member of] [R.sup.2] are detected image corners, [X.sup.t] [member of] [R.sup.3] are world 3D points, S is the set of all matches, p is the Cauchy cost function, [pi] is the projection function, R [member of] SO(3) is the orientation, and T [member of] [R.sup.3] is the position. This is solved iteratively using the Levenberg-Marquardt algorithm.

We introduce a modification to our previously developed VO algorithm [2,12]. The modification is a preprocessing step, where after the video frame image is loaded, it is divided into 64 x 64-pixel-sized tiles, which are in turn classified by structural content as either Random, Texture, or Transient image content [17]. Figure 5 shows the modified flow diagram of the VO algorithm.

Effect of Content Class

In order to investigate which content class is most beneficial for tracking corner points, a test sequence of 24 seconds was chosen. The video sequence consists of the first 240 frames of the KITTI 00 sequence in grayscale [27]. Our VO algorithm utilizes a transient local 3D map for tracking. It consists of a sparse set of 3D points, i.e., features that are used for tracking and, therefore, motion estimation. This local map is internal to the system and is not an attempt to build a global map of the environment. Detected feature points that are useful for tracking are considered tracked points and remain in this local map [12].

Figure 6 shows that the most detected AGAST corners are in the Random-class tiles. Also, the Transient-class tiles have the fewest detected corners, but the number does not fluctuate as much over the frames compared to the Random and Texture content. Similarly, Figure 7 shows the number of tracked corner points, which are used to compute the VO estimates. We note that the Random class again has many tracked features and again the Transient class has the least fluctuation. The ratio of tracked to detected feature points is outlined in Table 1. Although the Random class extracts and tracks many AGAST features, the ratio typically falls short relative to the other two classes.

An evaluation of various known feature point extraction algorithms is presented in Table 1. Here, the number of tracked feature points as a percentage of detected feature points is presented along with the standard deviation. The values are collected per class of scene content. The number of features detected is limited to 2000 features. These strongest and spread-out features across the image are selected using a technique known as adaptive non-maximal suppression as described by Brown et al. [29].

The evaluation shows that not all feature point extractors perform equally well in the regions of scene content classes. Transient-class content has the highest ratio of tracked to detected features over all feature detectors. For example, corners tracked in the Random class have slightly less variation, i.e., standard deviation. Features detected in the Transient class are consistently better. BRISK and ORB resulted in relatively poor results; however, they resulted in the highest precision, i.e., the lowest standard deviation, for feature detection across image content classes.

In Table 2, the accuracy of each feature extraction method used with our VO is reported. The translation error metric Et is introduced by the KITTI dataset paper [27] and is defined as the average translation error over all test subsequences of length 100,200, ...,700, 800 meters. It is can be seen in Table 2 that the accuracy of VO is fairly consistent among the feature extraction methods. We also note that although SIFT and SURF had a higher percentage of tracked corners as shown in Table 1. this did not reduce the VO translation error relative to the other algorithms.


To evaluate the impact of excluding some parts of the video frames, the KITTI sequence 00 was used [27]. The KITTI dataset includes a stereo greyscale image sequence alongside a ground truth that includes LIDAR data and high-accuracy GPS position measurements. The VO algorithm was run to estimate the location of the vehicle in four configurations: using the whole frame image as a baseline, excluding Random-class tiles, excluding Texture-class tiles, and excluding Transient-class tiles. Figure 8 shows how well the path was estimated for sequence 00 using the whole frame image without excluding any regions using an algorithm described previously [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. It can be seen that some parts of the road sections are not tracked very well. Figures 9 through 11 show the path estimates when the algorithm is modified to exclude tiles from the image that were classified as Random, Texture, and Transient, respectively. Qualitatively, a major notable difference is not evident for this testing sequence, which is of a daytime drive in clear weather in a feature-full environment. As discussed in the previous section, the classifier could be used to improve the number of inliers, i.e., the ratio of tracked to detected features, which can be useful in more challenging scenarios where few features are available to begin with and improving the inlier count can reduce the risk of tracking loss. Examples include fast driving and rain.

A more quantitative evaluation is shown in Table 3, where the translation error is computed for the baseline case with all image content included, as well as for the cases when structural content (from three classes) is excluded in the preprocessing step. We see that excluding the Random tiles increases VO translation drift error, which is likely because the Random class contributes many feature points (Figures 6 and 7). Error increases more so when excluding the Texture tiles, although they are not as many; we conclude that they are particularly useful and reliable for VO. On the other hand, excluding the Transient tiles reduces the error, which means that in this scenario, corners detected on Transient tiles are of lower quality and were hurting VO.

More insight into the effect of removing the content of one of the classes as a preprocessing step in the VO algorithm is given in Figures 12 and 13. Figure 12 shows the beginning frame of a challenging scene we identified where the feature-full building facades disappear behind dense trees and the vehicle enters a heavily shadowed road segment. For Figure 13, we introduce a new parameter, the percentage of inlier points at the VO motion estimation step. This parameter is found at a later stage in our algorithm, which follows the corner extraction and tracking based on feature descriptors that have been discussed so far in Figures 6 and 7 and Table 1, that is, the number of 3D map points that are successfully matched to detected 2D image features, i.e., that are tracked. These data associations will be fed to an optimization routine that solves for the pose as shown in Equation 1. The number of inliers found after this optimization is performed, and pose is computed. This measures how many of the tracked features are indeed good matches and how many are outliers. We calculate the percentage of inliers as the number of points that fit the motion estimation model divided by the number of tracked features. It can be considered as a measure of quality, where if the data is dominated by outliers, which can occur in challenging scenes, motion estimation can fail or degrade significantly.

Figure 13 shows the percentage of corners tracked that are inliers in each class, and we note that the exclusion of feature points from the Random class improves the inlier percentage on average in this scene. Also, exclusion of feature points from the Transient class reduces the inlier percentage, which agrees with our prior observations in Table 1, which shows that the most tracked features are found in the Transient-class tiles.


A preprocessing step is added to remove regions of Random, Texture, or Transient content from the video frame image as the vehicle travels along a path. The evaluations show that image regions with textural content result in the best-quality corners for VO. Random-class regions result in the greatest number of corners but with moderate quality. Transient- class regions exhibit an interesting effect, where they typically produce high-quality descriptors that trackmore successfully than features detected in the other two classes. Furthermore, excluding them from motion estimation actually does not reduce VO accuracy since the remaining features from the other two classes contain enough useful tracked features in the daytime scenario used. Image content affects machine vision algorithms of passenger vehicles. We explored a case of a feature-based VO algorithm; the discussion can carry over to feature-based algorithms in general, such as image registration, localization, and mapping. In our experiments, we leveraged a classifier developed in prior work that demonstrated improvements at various steps in our algorithm. We expect that a classifier or machine learning-driven corner selector that is trained and tuned on an automotive dataset specifically for the purpose of VO would exhibit more significant results. The opportunity lies in pushing the boundary of operation of algorithms into more challenging conditions, in terms of scene content, lighting, or weather, by using machine learning to pre-select reliable feature points before they are passed onto later stages of processing that might fail when the feature point quality and inlier count are low.


[1.] Rawashdeh, N.A. and Jasim, H.T., "Multi-Sensor Input Path Planning for an Autonomous Ground Vehicle," in 9th International Symposium on Mechatronics and Its Applications (ISMA), 2013, IEEE, 1-6.

[2.] Aladem, M., Rawashdeh, S., and Rawashdeh, N., "Evaluation of a Stereo Visual Odometry Algorithm for Passenger Vehicle Navigation," SAE Technical Paper 2017-01-0046, 2017, doi:10.4271/2017-01-0046.

[3.] Hane, C, Sattler, T., and Pollefeys, M., "Obstacle Detection for Self-Driving Cars Using Only Monocular Cameras and Wheel Odometry," in 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2015, IEEE, 5101-5108.

[4.] Chang, P., Camus, T., and Mandelbaum, R., "Stereo-Based Vision System for Automotive Imminent Collision Detection," in Intelligent Vehicles Symposium, 2004 IEEE, 2004, IEEE, 274-279.

[5.] Guney, F. and Geiger, A., "Displets: Resolving Stereo Ambiguities Using Object Knowledge," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4165-4175, 2015.

[6.] Schwarze, T. and Lauer, M., "Geometry Estimation of Urban Street Canyons Using Stereo Vision from Egocentric View,". In: Informatics in Control, Automation and Robotics. (Springer International Publishing, 2015), 279-292.

[7.] Liu, P., Wang, F., He, Y., Dong, H. et al., "Pose Estimation for Vehicles Based on Binocular Stereo Vision in Urban Traffic," in International Conference on Intelligent Computing, 2015, Springer International Publishing, 454-465.

[8.] Scaramuzza, D. and Fraundorfer, F., "Visual Odometry [Tutorial]," IEEE Robotics & Automation Magazine 18(4):80-92, 2011.

[9.] Hartley, R. and Zisserman, A., Multiple View Geometry in Computer Vision (Cambridge University Press, 2003).

[10.] Moravec, H.P., "Obstacle Avoidance and Navigation in the Real World by a Seeing Robot Rover," No. STAN-CS-80-813. Stanford University, California, Department of Computer Science, 1980.

[11.] Nister, D., Naroditsky, O., and Bergen, J., "Visual Odometry," in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004. Vol. 1, 2004, IEEE, I-652.

[12.] Aladem, M. and Rawashdeh, S., "Lightweight Visual Odometry for Autonomous Mobile Robots," Sensors 18(9):2837, 2018.

[13.] Rawashdeh, N.A. and Rawashdeh, S.A., "Effect of Structural Scene Content on Feature-Based Visual Odometry Performance," SAE Technical Paper 2018-01-0610, 2018, doi:10.4271/2018-01-0610.

[14.] Surgailis, T., Valinevicius, A., and Eidukas, D., "Stereo Vision Based Traffic Analysis System," Elektronika ir Elektrotechnika 107(1):15-18, 2015.

[15.] Choi, J., Lee, C., Eem, C., and Hong, H., "SLAM Method by Disparity Change and Partial Segmentation of Scene Structure," Journal of the Institute of Electronics and Information Engineers 52(8), 2015.

[16.] Vishnyakov, B.V., Vizilter, Y.V., Knyaz, V.A., Malin, I.K. et al., "Stereo Sequences Analysis for Dynamic Scene Understanding in a Driver Assistance System," in SPIE Optical Metrology, id. 95300P, 2015, International Society for Optics and Photonics.

[17.] Rawashdeh, N.A., Love, S.T., and Donohue, K.D., "Hierarchical Image Segmentation by Structural Content," Journal of Software 3(2):41, 2008.

[18.] Watson, A., "The Cortex Transform: Rapid Computation of Simulated Neural Images," Computer Vision Graphics and Image Processing 39:311-327, 1987, Academic Press.

[19.] Mair, E., Hager, G.D., Burschka, D., Suppa, M. et al., "Adaptive and Generic Corner Detection Based on the Accelerated Segment Test," in European Conference on Computer Vision, 2010, Springer Berlin Heidelberg, 183-196.

[20.] Calonder, M., Lepetit, V., Strecha, C., and Fua, P., "BRIEF: Binary Robust Independent Elementary Features," European Conference on Computer Vision 778-792, 2010, Springer Berlin Heidelberg.

[21.] Lowe, D.G., "Distinctive Image Features from Scale-Invariant Keypoints," International journal of Computer Vision 60(2):91-110, 2004.

[22.] Bay, H., Ess, A., Tuytelaars, T, and Van Gool, L., "Speeded-Up Robust Features (SURF)," Computer Vision and Image Understanding 110(3):346-359, 2008.

[23.] Rosten, E. and Drummond, T., "Machine Learning for High Speed Corner Detection," in European Conference on Computer Vision, 1, 2006.

[24.] Rublee, E., Rabaud, V., Konolige, K., and Bradski, G., "ORB: An Efficient Alternative to SIFT or SURF," in 2011 IEEE International Conference on Computer Vision (ICCV), 2011, IEEE, 2564-2571.

[25.] Leutenegger, S., Chli, M., and Siegwart, R., "BRISK: Binary Robust Invariant Scalable Keypoints," Proceedings of the International Conference on Computer Vision 2548-2555, 2011.

[26.] Shi, J. and Tornasi, C., "Good Features to Track," Technical Report TR-93-1399, Cornell University, 1993.

[27.] Geiger, A., Lenz, P., and Urtasun, R., "Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite," in 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, IEEE, 3354-3361.

[28.] Badino, H., Yamamoto, A., and Kanade, T., "Visual Odometry by Multi-Frame Feature Integration," in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, 222-229.

[29.] Brown, M., Szeliski, R., and Winder, S., "Multi-Image Matching Using Multi-Scale Oriented Patches," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, Vol. 1, June 2005, IEEE, 510-517.

Nathir A. Rawashdeh, German Jordanian University, Jordan

Mohamed Aladem, Stanley Baek, and Samir A. Rawashdeh, University of Michigan - Dearborn, USA


Received: 17 Jul 2018

Revised: 08 Feb 2019

Accepted: 04 Apr 2019

e-Available: 21 May 2019

TABLE 1 Performance of common feature extraction algorithms in the
three image content classes: Random, Texture, and Transient. Metrics
show mean and standard deviation of the number of tracked corners as a
percentage of detected corners in each class over a road driving video

                    Random              Texture class
Feature extractor   class (%)           (%)

AGAST + BRIEF       25.1 [+ or -] 13.6  24.7 [+ or -] 14.9
BRISK               14.8 [+ or -] 8.4   15.4 [+ or -] 9.5
ORB                 15.8 [+ or -] 7.7   16.4 [+ or -] 8.9
SIFT                31.0 [+ or -] 12.7  30.5 [+ or -] 14.4
SURF                32.6 [+ or -] 13.4  32.5 [+ or -] 14.7
FAST + BRIEF        25.3 [+ or -] 13.7  24.6 [+ or -] 15.2
Shi-Tomasi + BRIEF  21.2 [+ or -] 11.8  21.5 [+ or -] 13.1

Feature extractor   class (%)

AGAST + BRIEF       30.6 [+ or -] 14.8
BRISK               17.0 [+ or -] 8.5
ORB                 21.6 [+ or -] 10.4
SIFT                38.3 [+ or -] 12.3
SURF                40.5 [+ or -] 13.3
FAST + BRIEF        30.5 [+ or -] 15.0
Shi-Tomasi + BRIEF  27.9 [+ or -] 13.7

TABLE 2 Translation error of our VO with different feature extraction

Feature extractor   Et (%)

AGAST + BRIEF       1.44
BRISK               1.44
ORB                 1.80
SIFT                1.50
SURF                1.46
FAST + BRIEF        1.45
Shi-Tomasi + BRIEF  1.44

TABLE 3 Translation error Et (%) excluding different tile classes.

Preprocessing type        Et(%)

No exclusion              1.44
Random tiles excluded     1.56
Texture tiles excluded    1.61
Transient tiles excluded  1.41
COPYRIGHT 2018 SAE International
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2018 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Rawashdeh, Nathir A.; Aladem, Mohamed; Baek, Stanley; Rawashdeh, Samir A.
Publication:SAE International Journal of Passenger Cars - Electronic and Electrical Systems
Date:Aug 1, 2018
Previous Article:A Lane-Changing Decision-Making Method for Intelligent Vehicle Based on Acceleration Field.

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters