1. Camera-based Visual Inertial Navigation (VisNav) and Simultaneous
Localization and Mapping (SLAM) Error Reduction via Integration with
an Inertial Measurement Unit (IMU).
Michael Shawn Quinn
Graduate Student, Computer Science and Software Engineering
University of Washington, Bothell, WA
Abstract
The pose or position of a single camera in 3D can be used to monitor and track location
relative to a starting point, generating a map of a route travelled by a robot, autonomous
vehicle, or a human wearing an augmented reality headset. Camera-only tracking is
susceptible to loss of localization when the camera is moving at high speed, or cannot
clearly detect reference objects. An IMU does not require visible reference objects to
maintain localization, but an IMU does exhibit poor signal-to-noise ratio when moving at
low speed, the type of movement ideally suited to camera pose localization. A camera
and an IMU should ideally complement each other in tracking and localization
applications, and this paper presents data that demonstrates significant advantages of
combined camera-IMU VisNav/SLAM over camera-only or IMU-only implementations.
1.0 Introduction
The ability to navigate through an environment while avoiding obstacles and keeping a
record of the route travelled is an important capability for mobile robots, autonomous
vehicles, and human users wearing augmented reality head gear. Digital cameras provide
a view of the surrounding environment from which a significant amount of information
can be obtained. A monocular camera senses movement by tracking objects as they
appear in differing positions across a sequence of still images. Objects are identified by
locating key features, such as corners, and features are matched by the size, shape, and
intensity distribution of the image pixels comprising the feature. An important limitation
of this technique is that tracking stops or pauses when common features cannot be
identified between successive images. Moving at higher velocity further reduces the
likelihood of acceptable feature identification and tracking.
The term Visual Odometry, as discussed by [4] refers to the processing of images from a
camera or cameras to obtain sequences of camera translation measurements, incremental
measurements of distances translated by the camera itself. The measurements are with
respect to the reference frame of the 3D objects in the visible environment captured in the
2D images, through use of perspective geometry, and triangulation [6], [13]. This type of
odometry information can be used to keep track of total distance traveled, much the way
an automobile odometer does.
2. Visual Odometry data is a key component of systems performing Visual Navigation or
VisNav [10] and Simultaneous Localization and Mapping or SLAM [8]. VisNav
involves not only keeping track of incremental and total distance travelled, but also
monitors changes in direction and has the capability to implement obstacle avoidance.
SLAM establishes a zero reference starting point for a sequence of moves and records the
moves in a map of a route travelled. An important element of some SLAM
implementations is the ability to recognize when returning to the starting point, or to
acknowledge that an observed feature has been seen before.
An inertial measurement unit (IMU) allows a robot, autonomous vehicle, or augmented
reality headset user to keep track of distance moved and direction changes by integrating
acceleration and velocity measurements over time. The technique is relatively simple to
implement and the math behind it is relatively straightforward. Similar to Visual
Odometry, incremental move translations are generated over time intervals. The sensor
outputs are susceptible to electrical noise, and sensor offsets and scale factor errors are
difficult to eliminate. The resulting signal to noise ratio is improved when the sensors are
exited nearer the full scale ratings of the sensor. For this reason, an IMU works best
when navigating at higher velocity.
Camera-only VisNav and SLAM have been studied extensively in the recent past [12], as
have the performance of systems that utilize an IMU assisted by a camera. IMU only
performance and comparison of camera-only to IMU-only systems has been neglected up
to this point. To fill the void, this study compares camera-only, IMU-only, and camera-
IMU systems when tested over a common set of conditions. VisNav is defined in this
study as the ability to follow a path and avoid obstacles by means of vision-based object
identification and subsequent motion detection. SLAM is defined for this study as the
ability to start and maintain a record of a route traveled, by recording a sequence of
distance measurements relative to a zero reference starting point. It is hypothesized that
the camera-IMU system will perform better overall than either the camera-only system or
the IMU-only system.
2.0 Test Setup
2.1 Vision
A 2048 x 1536 pixel, color, USB camera, model number UI-1460SE-C-HQ,
manufactured by IDS Imaging, with 12 mm fixed focal length lens, Tamron No. 86761,
provides video input to both computer vision systems. Camera interface software was
implemented using the C++ application programming interface (API) from IDS. Image
processing software was developed using the OpenCV library version 3.1, for C++,
compiled in QtCreator.
2.2 IMU
3. A 3 axis IMU, model MotionNode manufactured by Motion Workshop, Seattle, WA,
provides navigation sensing input to the IMU-only and IMU-assisted computer vision
systems. The IMU consists of a 3 independent, orthogonal accelerometers, gyroscopes,
and magnetometers. Only the output from the accelerometers and gyroscopes are used in
this study. Sensor data is acquired via C++ API.
2.3 Platform
Camera, IMU, and notebook computer are mounted on a rolling platform to perform
tests. Camera and IMU are co-mounted on a rigid plate to ensure they experience
equivalent motion inputs.
2.4 Test Course
An indoor, concrete slab floor, marked off with tape, serves as the test track. Laminated
optical targets consisting of white backgrounds with alternating black squares in a 6 by 9
checkerboard pattern were mounted on the 4 surrounding walls to serve as computer
vision motion tracking objects. The test targets are intentionally arranged such that there
are points on the course where the cameras will temporarily lose track of the reference
objects. Visible timing markers were placed on the floor every 2 feet, allowing the
experimenter to log time stamps at regular intervals. The test course is shown in Figure
1.
Figure 1
4. 2.5 Computer
A Dell Inspiron notebook computer, with Intel Core-i7 processor, running Windows 7
and Linux Ubuntu 14.04 (dual boot) served as a computing platform. Initial tests were
run on Windows 7. The main control and data acquisition program was written in C++
using QtCreator version 5.5.1 and OpenCV version 3.1.
3.0 Data Gathering
Images used for motion detection were acquired at a rate of 15 frames per second from
the camera. Accelerometer and gyroscope data from the IMU was acquired over the
USB interface and stored in a delimited text file on the computer.
4.0 Algorithms
4.1 Vision
Before commencing the study, the camera is calibrated using the method in [1] and
the OpenCV Calibration module. The resulting calibration parameters are used to
correct each acquired image, removing lens distortion and correcting for non-
idealities in the camera sensor.
The basic steps in Visual Odometry, also commonly used in studies of structure-
from-motion, are discussed in [6] and are summarized here, with applicable
OpenCV modules identified where used:
Obtain first image
Move camera
Obtain second image
Detect objects, OpenCV optical flow
Track objects, OpenCV optical flow
Calculate fundamental and essential matrices, OpenCV 3D
Reconstruction
Use single value decomposition to obtain camera matrices
Calculate rotation and translation matrices required to move from first
image to second image, OpenCV 3D Reconstruction.
From resulting matrices, extract position data as 3D point coordinates
Repeat for next image pair
Basically, we treat the first image as a reference point and calculate how far the
object has moved based on the transformation required to convert the points in the
first image into the points in the second image. We are treating the image pair as
5. though it were a stereo image pair taken concurrently, instead of two separate
images taken at different times and different locations. We rotate the second
image, and adjust the height of the second image to allow horizontal point
correlation along epipolar lines, which is the basis of the optical flow algorithms.
Once we have a move value, we append it to our move sequence to continue
constructing out map for SLAM.
4.2 IMU
Linear velocity is obtained by integrating the acceleration value from the accelerometer,
linear displacement is found by integrating the calculated linear velocity values. Angular
acceleration is obtained by differentiating the angular velocity value from the gyroscope,
angular position is obtained by integrating the angular velocity values [9].
For this study, it is assumed that our course is sufficiently level to allow treating the
downward acceleration due to gravity as a constant. The true acceleration values will be
found be performing vector subtraction of this constant value from the recorded
accelerometer output.
5.0Test Protocol
Simultaneously start the timer, start the image and data acquisition, and begin
walking around the test route at approximately 2m/sec.
Note progress by hitting the space bar of the notebook computer keyboard when
passing interval markers on the floor.
6.0Data Analysis and Statistical Methods
Descriptive measures, including mean, variance, and standard deviation of measured
and observed values, will be used to depict the individual performance of each
system. Inferences from examining the difference between means and statistical
power calculations will be used to test the hypothesis that the IMU-assisted system
will perform better than the camera only system.
Statistical analysis is performed using Microsoft Excel 2013.
Calculate the deviation of measured incremental move translation from nominal
incremental move translation, at each time interval for the camera-only data set for
each speed setting.
Calculate the deviation of measured incremental move translation from nominal
incremental move translation, at each time interval for the IMU-only data set for
each speed setting.
6. Construct camera-IMU data sets for each speed setting by calculating the average
of the camera-only and IMU-only measured incremental move translation at each
time interval.
Calculate the deviation of camera-IMU incremental move translation from
nominal incremental move translation, at each time interval, for each speed
setting.
For a confidence level of 95%, a p value of .05, perform a z-test to test the null
hypothesis that there is no difference between the mean error of the camera-only
data set and the camera-IMU data set.
For a confidence level of 95%, a p value of .05, perform a z-test to test the null
hypothesis that there is no difference between the mean error of the IMU-only data
set and the camera-IMU data set.
7.0 Results
The course was traversed and the data was recorded. A total of 37 data points were
captured for both the camera and the IMU. The data was loaded into an Excel workbook
for analysis and chart generation. The combined camera-IMU data was calculated as the
average of the output of the camera-only and IMU-only data at each measurement point.
The maps were generated in an Excel XY scatter plot, by calculating the heading, where:
X = displacement magnitude x cos(heading angle)
Y = displacement magnitude x sin(heading angle)
The mapping results are shown in Figure 2.
7. Figure 2
As can be seen in Figure 2, both the camera and the IMU failed to track the nominal path
as well as the combined camera-IMU data, with the camera data varying near corners and
the IMU data variation appearing as a shift away from the nominal straight lines.
To determine the effectiveness of combining the two sensor outputs into a single data
value, we need to use a z-test. Excel has a built in function for calculating the z-test of
two means. The test was conducted to compare camera-only to camera-IMU data, and
IMU-only to camera-IMU data, based on the average error at each measurement interval
for each of the three data sets.
The comparison of the camera-only to camera-IMU resulted in:
z-Test: Two Sample for Means Camera-Only to Camera-IMU
0 0
-5
0
5
10
15
20
25
30
-2 0 2 4 6 8 10 12 14
Y
X
Mapping Results
Nominal Camera Only IMU Only Camera-IMU
8. Mean 0.049556219 -1.08633682
Known Variance 0.651 0.445
Observations 36 36
Hypothesized Mean
Difference 0
z 6.510036318
P(Z<=z) one-tail 3.75663E-11
z Critical one-tail 1.644853627
P(Z<=z) two-tail 7.51326E-11
z Critical two-tail 1.959963985
Since the two-tailed p value was less than .05, I would reject the null hypothesis that
there is no difference between the two means. Similarly for the IMU-only to camera-
IMU test:
z-Test: Two Sample for Means IMU-Only to Camera-IMU
0 0
Mean -2.155781176 -1.08633682
Known Variance 0.66854 0.445
Observations 36 36
Hypothesized Mean
Difference 0
z -6.080741366
P(Z<=z) one-tail 5.98141E-10
z Critical one-tail 1.644853627
P(Z<=z) two-tail 1.19628E-09
z Critical two-tail 1.959963985
Since the two-tailed p value was less than .05, I would reject the null hypothesis that
there is no difference between the two means
8.0Conclusions and Study Limitations
The camera performed poorly when turning corners, as the detection algorithm was
unable to maintain a lock of reference objects and computed incorrect heading angles.
The IMU performed equally well in straight lines as well as rounding corners, but was
subject to a static offset error and measurement noise that shifted the IMU off of the
nominal path.
Had time permitted it, the study would have been run at several different speeds to more
effectively characterize the tradeoffs between the two sensor types. A more precise
9. method of noting distance markers and of recording time intervals would improve the
repeatability of the study. The study would benefit from a more robust method of fusing
the camera and IMU output together, while at the same time potentially implementing an
error optimization algorithm, for example using a Kalman filter algorithm.
Even with the aforementioned limitations, the combination of the camera and the IMU
together was clearly superior to either sensor working alone, and would likely perform at
a greater advantage by implementing the improvements stated above.
9.0References
[1] Bradski, G. R., & Kaehler, A. (2008). Learning OpenCV: Computer vision with the
OpenCV library (1st ed.). Sebastopol, CA: O'Reilly.
[2] The OpenCV website, www.opencv.org.
[3] Siegwart, R., & Nourbakhsh, Illah R. (2011). Introduction to Autonomous Mobile Robots
(2nd Edition). MIT Press.
[4] Nister, D., Naroditsky, O., & Bergen, J. (2004). Visual Odometry. Computer Vision and
Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society
Conference on, 1, I.
[5] Davison. (2003). Real-time simultaneous localisation and mapping with a single
camera. Computer Vision, 2003. Proceedings. Ninth IEEE International Conference
on, 1403-1410.
[6] Scaramuzza, D., & Fraundorfer, F. (2011). Visual Odometry:Part I [Tutorial].Robotics &
Automation Magazine, IEEE, 18(4), 80-92.
[7] Scaramuzza, D., & Fraundorfer, F. (2012). Visual Odometry: Part II [Tutorial].Robotics
& Automation Magazine, IEEE, 19(2), 78-90.
[8] Rothganger, F., & Muguira, M. (2007). SLAM using camera and IMU sensors.
[9] Seifert K., & Camacho O. (2007). Implementing Positioning Algorithms Using
Accelerometers, AN3397, Freescale Semiconductor.
[10] Troiani, C., Martinelli, A., Laugier, C., & Scaramuzza, D. (2015). Low computational-
complexity algorithms for vision-aided inertial navigation of micro aerial vehicles.
Robotics And Autonomous Systems, 69, 80-97.
[11] Frese, U. (2006). A Discussion of Simultaneous Localization and Mapping. Autonomous
Robots, 20(1), 25-42.
10. [12] Davison, A. J. (2003). Real-time simultaneous localisation and mapping with a single
camera. Proceedings of the IEEE International Conference on Computer Vision, 2, 1403-
1410.
[13] Emami, S., & Levgen, Khvedchenia. (2012). Mastering OpenCV with Practical
Computer Vision Projects. Birmingham: Packt Publishing.