Introduction to Rust-CV

Rust CV is a project that aims to bring Computer Vision (CV) algorithms to the Rust programming language. To follow this tutorial a basic understanding of Rust, and its ecosystem, is required. Also, computer vision knowledge would help.

About this book

This tutorial aims to help you understand how to use Rust-CV and what it has to offer. In is current state, this tutorial is rather incomplete and a lot of examples are missing. If you spot an error or wish to add a page feel free to do a PR. We are very open so don't hesitate to contribute.

The code examples in this book can be found here. The source for the book can be found here.

Project structure

Before using Rust-CV it is important to understand how the ecosystem is set-up and how to use it.

If you look at the repository you can see multiples directories and a Cargo.toml file that contains a [workspace] section. As the name implies, we are using the workspace feature of Cargo. To put it simply, a cargo workspace allows us to create multiple packages in the same repository. You can check the official documentation to get more details.

This is what happens here. Rust-CV is a mono-repository project which is made of multiple packages. For instance, we have the cv-core directory that contains the very basic things for computer vision. We also have the imgshow directory which allow us to show images. There is many more but we won't go deeper for now. The cv crate needs to be explained though. The cv crate is rather empty (code wise) and just reexports most of the other package so by just depending on it we have most of what the project has to offer.

There are three things to remember here :

Rust-CV repository is a mono repository that is split-up into multiple packages.
The way the project is structured allows you to use tiny crates so you don't have to pay the price for all the CV code if you just use a subpart.
The project defines a crate named cv that depends on many others just to re-export them. This is useful to get started faster, as by just pulling this crate you have already many computer vision algorithms and data structures ready to use.

First program

In this chapter, we will be reviewing our first program. It is not even really related to computer vision, but it will help you get started with some basic tools that will be used again later and get familiar with the tutorial process. We will just be loading an image, drawing random points on it, and displaying it. This is very basic, but at least you can make sure you have everything set up correctly before attacking more ambitious problems.

Running the program

In order to start, we will clone the Rust CV mono-repo. Change into the repository directory. Run the following:

cargo run --release --bin chapter2-first-program

You should see a grayscale image (from the Kitti dataset) with thousands of small blue translucent crosses drawn on it. If you see this, then everything is working on your computer. Here is what it should look like:

Random points

We are now going to go through the code in tutorial-code/chapter2-first-program/src/main.rs piece-by-piece. We will do this for each chapter of the book and its relevant example. It is recommended to tweak the code in every chapter to get a better idea of how the code works. All of the code samples can be found in tutorial-code. We will skip talking about the imports unless they are relevant at any point. Comments will also be omitted since we will be discussing the code in this tutorial.

The code

Load the image


#![allow(unused)]
fn main() {
    let src_image = image::open("res/0000000000.png").expect("failed to open image file");
}

This code will load an image called res/0000000000.png relative to the current directory you are running this program from. This will only work when you are in the root of the Rust CV mono-repo, which is where the res directory is located.

Create a random number generator


#![allow(unused)]
fn main() {
    let mut rng = rand::thread_rng();
}

This code creates a random number generator (RNG) using the rand crate. This random number generator will use entropy information from the OS to seed a fairly robust PRNG on a regular basis, and it is used here because it is very simple to create one.

Drawing the crosses


#![allow(unused)]
fn main() {
    let mut image_canvas = drawing::Blend(src_image.to_rgba8());
    for _ in 0..5000 {
        let x = rng.gen_range(0..src_image.width()) as i32;
        let y = rng.gen_range(0..src_image.height()) as i32;
        drawing::draw_cross_mut(&mut image_canvas, Rgba([0, 255, 255, 128]), x, y);
    }
}

This section of code is going to iterate 5000 times. Each iteration it is going to generate a random x and y position that is on the image. We then use the imageproc::drawing module to draw a cross on the spot in question. Note that the image_canvas is created by making an RGBA version of the original grayscale image and then wrapping it in the imageproc::drawing::Blend adapter. This is done so that when we draw the cross onto the canvas that it will use the alpha value (which we set to 128) to make the cross translucent. This is useful so that we can see through the cross a little bit so that it doesn't totally obscure the underlying image.

Changing the image back into DynamicImage


#![allow(unused)]
fn main() {
    let out_image = DynamicImage::ImageRgba8(image_canvas.0);
}

We now take the RgbaImage and turn it into a DynamicImage. This is done because DynamicImage is a wrapper around all image types that has convenient save and load methods, and we actually used it when we originally loaded the image.

Write the image to a temporary file


#![allow(unused)]
fn main() {
    let image_file_path = tempfile::Builder::new()
        .suffix(".png")
        .tempfile()
        .unwrap()
        .into_temp_path();
    out_image.save(&image_file_path).unwrap();
}

Here we use the tempfile crate to create a temporary file. The benefit of a temporary file is that it can be deleted automatically for us when we are done with it. In this case it may not get deleted automatically because the OS image viewer will later be used and it may prevent the file from being deleted, but it is good practice to create temporary files to store image.

After we create the temporary file path, we write to the path by saving the output image to it.

Open the image


#![allow(unused)]
fn main() {
    open::that(&image_file_path).unwrap();
    std::thread::sleep(std::time::Duration::from_secs(5));
}

We use the open crate here to open the image file. This will automatically use the program configured on your computer for viewing images to open the image. Since the image program does not open the file right away, we have to sleep for some period of time to ensure we don't delete the temporary file.

End

This is the end of this chapter.

Feature extraction

In this chapter, we will be writing our second Rust-CV program. Our goal will be to run the AKAZE extractor and display the result.

What is an image feature?

Features are comprised of a location in an image (called a "keypoint") and some data that helps characterize visual information about the feature (called a "descriptor"). We typically try to find features on images which can be matched to each other most easily. We want each feature to be visually discriminative and distinct so that we can find similar-looking features in other images without getting false-positives.

For the purposes of Multiple-View Geometry (MVG), which encompasses Structure from Motion (SfM) and visual Simultaneous Localization and Mapping (vSLAM), we typically wish to erase certain information from the descriptor of the feature. Specifically, we want features to match so long as they correspond to the same "landmark". A landmark is a visually distinct 3d point. Since the position and orientation (known as "pose") of the camera will be different in different images, the symmetric perspective distortion, in-plane rotation, lighting, and scale of the feature might be different between different frames. For simplicity, all of these terms mean that a feature looks different when you look at it from different perspectives and at different ISO, exposure, or lighting conditions. Due to this, we typically want to erase this variable information as much as possible so that our algorithms see two features of the same landmark as the same despite these differences.

If you want a more precise and accurate description of features, reading the OpenCV documentation about features or Wikipedia feature or descriptor pages is recommended.

What is AKAZE and how does it work?

In this tutorial, we use the feature extraction algoirthm AKAZE. AKAZE attempts to be invariant to rotation, lighting, and scale. It does not attempt to be invariant towards skew, which can happen if we are looking at a surface from a large angle of incidence. AKAZE occupies a "sweet spot" among feature extraction algorithms. For one, AKAZE feature detection can pick up the interior of corners, corners, points, and blobs. AKAZE's feature matching is very nearly as robust as SIFT, the gold standard for robustness in this class of algorithm, and it often exceeds the performance of SIFT. In terms of speed, AKAZE is significantly faster than SIFT to compute, and it can be sped up substantially with the use of GPUs or even FPGAs. However, the other gold standard in this class of algorithms is ORB (and the closely related FREAK, which can perform better than ORB). This algorithm targets speed, and it is roughly 5x faster than AKAZE, although this is implementation-dependent. One downside is that ORB is significantly less robust than AKAZE. Due to these considerations, AKAZE can, given the correct algorithms and computing power, meet the needs of real-time operation while also having high quality output.

AKAZE picks up features by looking at the approximation of the determinant of the hessian over the luminosity to perform a second partial derivative test. This determinant is called the response in this context. Any local maxima in the response greater than a threshold is detected as a keypoint, and sub-pixel positioning is performed to extract the precise location of the keypoint. The threshold is typically above 0 and less than or equal to 0.01. This is done at several different scales, and at each of those scales the image is selectively blurred and occasionally shrunk. In the process of extracting the determinant of the hessian, it extracts first and second order gradients of the luminosity across X and Y in the image. By using the scale and rotation of the feature, we determine a scale and orientation at which to sample the descriptor from. The descriptor is extracted by making a series of binary comparisons in the luminosity, the first order gradients of the luminosity, and the second order gradients of the luminosity. Each binary comparison results in a single bit, and that bit is literally stored as a bit on the computer. In total, 486 comparisons are performed, thus AKAZE has a 486-bit "binary feature descriptor". For convenience, this can be padded with zeros to become 512-bit.

Running the program

Make sure you are in the Rust CV mono-repo and run:

cargo run --release --bin chapter3-akaze-feature-extraction

If all went well you should have a window and see this:

Akaze result

The code

Open the image


#![allow(unused)]
fn main() {
    let src_image = image::open("res/0000000000.png").expect("failed to open image file");
}

We saw this in chapter 2. This will open the image. Make sure you run this from the correct location.

Create an AKAZE feature extractor


#![allow(unused)]
fn main() {
    let akaze = Akaze::default();
}

For the purposes of this tutorial, we will just use the default AKAZE settings. You can modify the settings at this point by changing Akaze::default() to Akaze::sparse() or Akaze::dense(). It also has other settings you can modify as well.

Extract the AKAZE features


#![allow(unused)]
fn main() {
    let (key_points, _descriptors) = akaze.extract(&src_image);
}

This line extacts the features from the image. In this case, we will not be using the descriptor data, so those are discarded with the _.

Draw crosses and show image


#![allow(unused)]
fn main() {
    for KeyPoint { point: (x, y), .. } in key_points {
        drawing::draw_cross_mut(
            &mut image_canvas,
            Rgba([0, 255, 255, 128]),
            x as i32,
            y as i32,
        );
    }
}

Almost all of the rest of the code is the same as what we saw in chapter 2. However, the above snippet is slightly different. Rather than randomly generating points, we are now using the X and Y components of the keypoints AKAZE extracted. The output image actually shows the keypoints of the features AKAZE found.

End

This is the end of this chapter.

Feature matching

In this chapter, we will be matching AKAZE features and displaying the matches.

How do we perform feature matching?

Binary Features

As explained in the previous chapter, AKAZE's feature descriptors are comprised of bits. Each bit describes one comparison made between two points around the feature. These descriptors are generally designed so that similar features have similar descriptors. In this case, that means that similar features will have similar bit patterns. For AKAZE features, the number of bits that are different between two descriptors is the distance between them, and the number of bits in common is the similarity between them.

The primary reason that we use binary features is because computing the distance between two features is blazingly fast on modern computers. This allows us to get real-time performance. The amount of time that it takes to compare two binary features in practice is typically a few nanoseconds. Compare this to computing the distance between high dimensional (e.g., 128 dimensional for SIFT) floating point vectors, which typically takes much longer, even with modern SIMD instruction sets or the use of GPUs. How is this achieved?

How is the distance computed?

The way that the distance is computed can be summed up in two operations. The first operation is XOR. We simply take the XOR of the two 486-bit feature descriptors. The result has a 1 in every bit which is different between the two, and a 0 in every bit which is the same between the two. At this point, if you count the number of 1 in the output of the XOR, this actually tells us the distance between the two descriptors. This distance is called the hamming distance.

The next step we just need to count the number of 1 in the result, so how do we do that? Rust actually has a built-in method to do this on all primitive integers called count_ones.

How is this so fast?

We actually have an instruction on modern Intel CPUs called POPCNT. This will count the number of 1 in a 16-bit, 32-bit, or 64-bit number. This is actually incredibly fast, but it is possible to go faster.

To go faster, you can use SIMD instructions. Unfortunately, Intel has never introduced a SIMD POPCNT instruction, nor has ARM, thus we have to do this manually. Luckily, SIMD instructions have so many ways to accomplish things that we can hack our way to success. A paper was written on this topic called "Faster Population Counts Using AVX2 Instructions", but the original inspiration for this algorithm can be found here. The core idea of these algorithms is to use the PSHUFB instruction, added in SSSE3. This instruction lets us perform a vertical table look-up. In SIMD parlance, a vertical operation is one that performs the same operation on several segments of a SIMD register in parallel. The way the PSHUFB instruction from SSSE3 works is that it will let us perform a 16-way lookup. A 16-way lookup requires 4 input bits. The table we use is the population count operation for 4 bits. This means that 0b1000 maps to 1, 0b0010 maps to 1, 0b1110 maps to 3, and 0b1111 maps to 4. From there, we need to effectively perform a horizontal add to compute the output. In SIMD parlance, a horizontal operation is one which reduces (as in map-reduce) information within a SIMD register, and will typically reduce the dimensionality of the register. In this case, we want to add all of the 4-bit sums into one large sum. The method by which this is achieved is tedious, and the paper mentioned above complicates matters further by making it operate on 16 separate 256-bit vectors to increase speed further.

The lesson here is that SIMD can help us speed up various operations when we need to operate on lots of data in vectors. You should rely on libraries to take care of this or let the compiler optimize this out by using fixed size bit arrays where possible, although it wont be perfect. If you absolutely must squeeze that last bit of speed out because the problem demands it and you know your problem domain and preconditions well, then the rabbit hole goes deep, so feel free to explore the literature on population count. The XOR operation doesn't really get more complicated though, it's just an XOR, and everything supports XOR (that I know of).

Tricks often used when matching

Lowe's ratio test

Matching features has a few more tricks that are important to keep in mind. Notably, one of the most well-known tricks is the usage of the Lowe's ratio test, named after its creator David Lowe who offhandedly mentions it in the paper "Distinctive Image Features from Scale-Invariant Keypoints".

It is a recurring theme in computer vision that algorithms get named after the authors because they do not name algorithms which turn out to be incredibly important and fundamental. However, if we named every such algorithm that Richard Hartley created after himself, we would probably call this field Hartley Vision instead. Moving on...

The Lowe's ratio is very simple. When we compute the distance between a specific feature and several other features, some of those other features will have a smaller distance to this specific feature than others. In this case, we only care about the two best matching features, often called the two "nearest neighbors" or often abbreviated as 2-NN (k-NN in the general case). We simply want to see what the ratio is between the best distance and the second best distance. If this ratio is very small, that means that the feature you matched to is much better than the second best alternative, so it is more likely to be correct. Therefore, we just test to see if the ratio is below some threshold which is typically set to about 0.7.

The Lowe's ratio doesn't make as much sense for binary features. Consider that the distance between two independent features should be roughly 128-bits, and it should form a perfect binomial distribution. Additionally, the Lowe's ratio is typically applied to floating point vectors in euclidean space (L2 space), while binary features exist in hamming space. Therefore, with binary features, it may suffice to check that the best feature's distance is a certain number of bits better than the second best match. You can apply both the Lowe's ratio and this binary Lowe's test to more than just the first and best match specifically. It can also be applied to the best matches of the first and second best groups of matches to determine if one group may match a feature better than another group.

The Lowe's ratio test helps us filter outliers and only find discriminative matches.

Symmetric matching

Another important thing to do when matching is to match symmetrically. Typically, when you match features, you have two groups of features, the new ones and the ones you already have. With symmetric matching you are assuming that no feature appears twice in the new group and the group you already have. This can also be done in addition to the Lowe's ratio test. Conceptually, you can understand symmetric matching in the following way.

If you find the matches from the new group to the old group, that gives you a best match in the old group for each item in the new group.

If you find the matches from the old group to the new group, that gives you a best match in the new group for each item in the old group.

If the match in the new group and the match in the old group are only each other, then they are symmetric.

Another way to try and understand symmetric matching is with a visual aid. Consider a 1d line. Three features (X, Y, and Z) are in the following line: X-----Y-Z. Y is much closer to Z than to X. The closest match to X is Y, but the closest match to Y is Z. Therefore X and Y do not match symmetrically. However, Y and Z do form a symmetric match, because the closest point to Y is Z and the closest point to Z is Y. This analogy does not perfectly apply to our scenario as there aren't two separate groups, but the general concept still applies and can be used in other scenarios even outside of computer vision.

Symmetric matching helps filter outliers and only find discriminative matches, just like the Lowe's ratio test (or Lowe's distance test). When used together, they can be very powerful to filter outliers without even yet considering any geometric information.

Running the program

Make sure you are in the Rust CV mono-repo and run:

cargo run --release --bin chapter4-feature-matching

If all went well you should have a window and see this:

Matches

The code

Open two images of the same area


#![allow(unused)]
fn main() {
    let src_image_a = image::open("res/0000000000.png").expect("failed to open image file");
    let src_image_b = image::open("res/0000000014.png").expect("failed to open image file");
}

We already opened an image in the other chapters, but this time we are opening two images. Both images are from the same area in the Kitti dataset.

Extract features for both images and match them


#![allow(unused)]
fn main() {
    let akaze = Akaze::default();

    let (key_points_a, descriptors_a) = akaze.extract(&src_image_a);
    let (key_points_b, descriptors_b) = akaze.extract(&src_image_b);
    let matches = symmetric_matching(&descriptors_a, &descriptors_b);
}

We created an Akaze instance in the last chapter. We also extracted features as well, but this time two things are different. For one, this time we are keeping the descriptors and not just throwing them away. Secondly, we are extracting features from both images.

The next part is the most critical. This is where we match the features. We perform symmetric matching just as is explained above. The tutorial code has the full code for the symmetric matching procedure, and you can read through it if you are interested, but for now we will just assume it works as described. It also performs the "Lowe's distance test" mentioned above with the binary features. It is using a distance of 48 bits, which is quite large, but this results in highly accurate output data. Feel free to mess around with this number and see what kinds of results you get with different values.

Make a canvas


#![allow(unused)]
fn main() {
    let canvas_width = src_image_a.dimensions().0 + src_image_b.dimensions().0;
    let canvas_height = std::cmp::max(src_image_a.dimensions().1, src_image_b.dimensions().1);
    let rgba_image_a = src_image_a.to_rgba8();
    let rgba_image_b = src_image_b.to_rgba8();
    let mut canvas = RgbaImage::from_pixel(canvas_width, canvas_height, Rgba([0, 0, 0, 255]));
}

Here we are making a canvas. The first thing we do is we create a canvas with all black pixels. The canvas must be able to fit both images side-by-side, so that is taken into account when computing the width and height.

Render the images onto the canvas


#![allow(unused)]
fn main() {
    let mut render_image_onto_canvas_x_offset = |image: &RgbaImage, x_offset: u32| {
        let (width, height) = image.dimensions();
        for (x, y) in (0..width).cartesian_product(0..height) {
            canvas.put_pixel(x + x_offset, y, *image.get_pixel(x, y));
        }
    };
    render_image_onto_canvas_x_offset(&rgba_image_a, 0);
    render_image_onto_canvas_x_offset(&rgba_image_b, rgba_image_a.dimensions().0);
}

The first thing we do here is create a closure that will render an image onto the canvas at an X offset. This just goes through all X and Y combinations in the image using the cartesian_product adapter from the itertools crate. It puts the pixel from the image to the canvas, adding the X offset.

We then render image A and B onto the canvas. Image B is given an X offset of the width of image A. This is done so that both images are side-by-side.

Something that isn't shown here, but might be informative to some, is that because the last use of this closure is on this line, Rust will implicitly drop the lifetime constraint from the closure to the canvas, allowing us to mutate the canvas in the future. If we were to call this closure again on a later line then Rust might complain about borrowing rules. Rust does this to make coding in it more pleasant, but it can be confusing for beginners who may not have an intuition about how the borrow checker works.

Drawing lines for each match

This part is broken up to make it more manageable.


#![allow(unused)]
fn main() {
    for (ix, &[kpa, kpb]) in matches.iter().enumerate() {
        let ix = ix as f64;
        let hsv = Hsv::new(RgbHue::from_radians(ix * 0.1), 1.0, 1.0);
        let rgb = Srgb::from_color(hsv);
}

At this part we are iterating through each match. We also enumerate the matches so that we have an index. The index is converted to a floating point number. The reason we do this is so that we can assign a color to each index based on the index. This is done by rotating the hue of an HSV color by a fixed amount for each index. The HSV color has max saturation and value, so we basically get a bright and vivd color in all cases, and the color is modified using the radians of the hue. From this we produce an SRGB color, which is the color space typically used in image files unless noted otherwise.


#![allow(unused)]
fn main() {
        // Draw the line between the keypoints in the two images.
        let point_to_i32_tup =
            |point: (f32, f32), off: u32| (point.0 as i32 + off as i32, point.1 as i32);
        drawing::draw_antialiased_line_segment_mut(
            &mut canvas,
            point_to_i32_tup(key_points_a[kpa].point, 0),
            point_to_i32_tup(key_points_b[kpb].point, rgba_image_a.dimensions().0),
            Rgba([
                (rgb.red * 255.0) as u8,
                (rgb.green * 255.0) as u8,
                (rgb.blue * 255.0) as u8,
                255,
            ]),
            pixelops::interpolate,
        );
    }
}

This code is longer than its intention really is. First we create a closure that converts a keypoint's point, which is just a tuple of floats in sub-pixel position, into integers. It also adds a provided offset to the X value. This will be used to offset the points on image B when drawing match lines.

The next step is that we call the draw function. Note that for the first point we use the key point location from image A in the match and for the second point we use the key point location from image B in the match. Of course, we add an offset to image B since it is off to the right of image A. The rgb value is in floating point format, so we have to convert it to u8. This is done by multiplying by 255 and casting (which rounds down). The alpha value is set to 255 to make it totally solid. Unlike the previous chapters, this function doesn't support blending/canvas operation, so we just set the alpha to 255 to avoid it acting strange.

Display the resulting image

The rest of the code has been discussed in previous chapters. Go back and see them if you are curious how it works. All it does is save the matches image out and open it with your system's configured image viewer.

End

This is the end of this chapter.

Rust-CV Tutorials