Jake's CS Notes - Computer Vision

//****************************************************************************//
//******** Feature Detection / Harris Detector - September 16th, 2019 *******//
//**************************************************************************//

- Don't forget: the 1st part of homework 2 is due THIS WEDNESDAY!

--------------------------------------------------------------------------------
- Okay - today is the day we start talking about computer vision PROPERLY!
    - Up until now, we've been talking about some image processing fundamentals, mixed with a crash-course through deep learning
        - "There's a decent amount of math here; it's not a ridiculous amount of it, but there will be some calculus, which as CS majors you probably haven't seen since your freshman year"

- As it turns out, one BIG issue in computer vision is the issue of CORRESPONDENCE: detecting/matching features in one image to the same features in a different image
    - Why is this important? Well, it lets us determine if a new image is a totally different image, or if it's just a shifted view of the same statue/lampost/elderly relatives/etc.
        - More than this, we might be able to find the FUNDAMENTAL MATRIX of the image, letting us calculate how the view between the different images has changed and opening the door to a bunch of cool stuff like 3D scanning, motion tracking, position mapping, robot navigation, and a BUNCH more
    - "This also used to be the key step in object recognition, but deep learning has made this less and less necessary"

- So, let's start getting into how we can detect important features/keypoints/etc. in an object (they're all the same thing), and then possibly get to matching these across images tomorrow

- First up: how can we find points on an image that might be useful to recognize it from different positions later on?
    - One common way to do this is to go for sharp CORNERS in an image
    - Once we've found these keypoints, we can define a small region around each point, try and normalize for scale/rotation across the images, and then match 
        - "Pretty soon we're going to have a midterm, and I'm going to ask you questions like 'how many features do we need to align 2 images'"?
            - ...in this case for the cat image on the slide, we're looking at an affine transformation, which has 6 degrees of freedom, and each point gives us a 2D constraint, so we'd need 3 points 
            - "I'd recommend you write this down"report
    - The goal is to find points that are both repeatable and distinctive, so we can find the same features across multiple different view of the same image
        - REPEATABLE features are ones that we can find despite geometric/photometric transformations in the image
        - SALIENCY is how distinct the feature is from the rest of the image
        - EFFICIENCY means that we have significantly less features than image pixels (so we're not using an obscene amount of memory)
        - LOCALITY means that the feature only takes up a small part of the image, which makes it less likely we'll

- With those being the goals for our feature-detector, how can we actually do this?
    - There're a TON of different algorithms for this, but we'll talk about the Harris algorithm, since you need to use it for your homework
        - This is an older algorithm (it's originally from 1988), but it's still used VERY frequently because it's relatively cheap to compute and "good enough" for many tasks
    - The main idea here is that, if we look through a tiny window at a piece of our image, we should be able to tell where we are
        - If we're in a flat region with the same color/pattern, that's no good; everything looks the same!
        - If we're on an edge, we can at least tell we're at a boundary, but we can't tell WHERE on the edge we are...
        - But if we're at a corner, where the color changes whichever direction we move, then we're good!
    - Mathematically, how do we tell how "cornery" something is? By trying to tell how different the pixels will be if we move around a little!
        - More specifically, we say the difference score for a WINDOW of pixels at (x,y), shifted by (u,v), is:

                E(u,v) = sum over window's pixels { w(x,y) * (I(x+u, y+v) - I(x,y))^2}

            - Where:
                - I(x,y) is the intensity of the original pixel at (x,y) in the window
                - I(x+u, y+v) is the intensity of that pixel shifted by (u,v)
                - W(x,y) is the "window function" telling us how to weight each pixel at (x,y) in the window (Gaussian to weight center pixels, mean to weight each equally, etc.)
            - So, this is essentially giving us the sum of squared differences in brightness over all the pixels around our original pixel ("the chosen pixel"), between two different windows: our "original" window and the "shifted" window
                - So, for E(0,0), the difference score will be 0 since we're subtracting our original window's brightness sum by itself!
                    - If we shift the window by (u,v), then, the score will get higher the more different it is from our original window!
        - How does this match up with our corner classifications? Suppose we compute these difference scores over a small range of window shifts (say 8x8), with "black" being a score of 0 and white being very high difference scores
            - If we're on a flat, uniform-colored area of the image, this 8x8 scoring image would be totally black - there's no differences anywhere in the image, so we can't tell where we are!
            - On the edge, if we choose a pixel right in the middle of a smooth edge, we'd have 0 difference along the edge but increasing difference as we move from the side; we'd end up with a U-shaped tube
                - Or, in 2D terms, there'd be a black "line" of no difference along the edge in our 8x8 image, but lighter colors to the side, since those areas are different
            - On the corner, though, we'd still have 0 difference in the middle, but around it the difference would quickly rise, like a bowl!
                - So, this equation successfully identifies corners as being the "best"!

- So, that equation looks like a good bet - but it's HORRIBLY inefficient
    - Not only do we have to iterate over every pixel in the image, but we have to compute these windows for EACH ONE, and for each pixel we need to compute multiple windows for our "shift range"
        - If we have a 600x600 image, and we're using an 11x11 window size and an 11x11 "shift range" to get this grid of scores, we'd end up with:

                O(image_width^2 * window_size^2 * shift_size^2)
                = 600^2 * 11^2 * 11 ^2

            - ...in other words, we need to do 11^4 = 14,641 pixel-difference calculations for each pixel in the image - not great!

- So, how can we speed this equation calculation up?
    - Remember Taylor expansions from calculus? We can approximate a function by summing together it's derivatives, right?
        - In Engineering, EVERYTHING locally looks like a line if you're close enough, and everything looks like a quadratic if you're far enough away
    - Well, in 2D there's a similar Taylor approximation we can do with linear algebra, and if we take the 2nd order Taylor expansion as "good enough" then we can approximate our equation as:

                                [E_u(0,0)]            [E_uu(0,0) E_uv(0,0)]|u|
        E(u,v) ~= E(0,0) + [u v][E_v(0,0)] + 0.5*[u v][E_uv(0,0) E_vv(0,0)]|v|

    - This lets us avoid having to sum over the entire window for each pixel! We can approximate it by just taking the derivative!
        - Even better, the 1st derivative for an image ends up ALWAYS being 0, so we can ignore it!
        - How do we calculate the 2nd derivative of an image, though?
            - That's actually a little bit cumbersome, but we can approximate it using the SECOND MOMENT MATRIX, where we take the gradient (i.e. the difference in the pixels) in the X/Y directions, and then put them into a matrix like this:

                                                  [I_x^2  I_xI_y]
                    M = Sum over window{ w(x,y) * [I_xI_y  I_y^2] }

                                         |u|
                    E(u,v) = [u v] * M * |v|

                - This means we only have to compute these gradients once, and we can then reuse them for every pixel, which is MUCH faster than computing the gradients separately for every pixel
                    - This still involves summing over the window for each pixel (the "I" can be reused, but M has to be recalculated), but that's SUPER fast in modern libraries
            - If we look at this E(u,v) function, it looks like a 3D parabola, and its cross-sections actually look like ellipses for each pixel, with its major/minor axes given by the eigenvalues for the "M" matrix of the pixel (specifically, lambda^-0.5)
                - When neither or only 1 of the eigenvalues is large, we're on an edge/flat region (since the width of the parabola'll be HUGE, i.e. growing slowly); but when BOTH of them are large, we're at a corner!
        - We can calculate the "corner score" of each pixel, then, by using the determinant/trace of the matrix:

                R   = lambda1*lambda2 - a*(lambda1 + lambda2)^2
                    = determinant(M) - a*trace(M)^2

            - ...where "a" is some constant weight we give (usually ~0.05)
    - PHEW. After all that math, we're nearly done; to find the corners in our image, we just need to do the following:
        1) Compute the M matrix/corner score "R" for each pixel
        2) Ignore any pixels with a score below a certain threshold
        3) In each region of high pixel score in the image, just take the local maxima (to cut down on how many points we have)
- The end result of all of this should be a list of points in the image, matching up (hopefully) with some good points to recognize

- Okay, that's how to recognize features - we'll talk about how to match them up across different images on Wednesday!