Jake's CS Notes - Computer Vision

//****************************************************************************//
//**************** Pose Estimation - September 30th, 2019 *******************//
//**************************************************************************//

- Ah, lack of sleep: turning everything I hear into an odd mixture of a lullaby and someone reciting "Jabberwocky"

- Alright, we're on September 30th and 
    - Project 3 will be out tomorrow, and'll be about pose estimation, finding correspondences between images, and -
        - *at this point Professor Dellaert's microphone died*
            - "...I'll just project and totally kill my voice"
    - These projects are obviously pretty complex; you have the code, the python notebook, the textbook, the slides, and you have to synthesize all of them - but the slides should be helpful in guiding you through what to do
--------------------------------------------------------------------------------

- So, last week we were talking about image alignment and the RANSAC algorithm, where randomly select some points, generate a straight-line regression, check the number of inliers, and repeat until we get a good enough model or have run for too many trials
    - If we don't know the percentage of outliers in advance, then we'll start off by assuming it's 50%
        - We'll then run the algorithm for some number of trials "N," and update this based on the results we find
    - We also talked about 2D image alignment, where we have a set of matching points and try to fit a parametric model representing the image transformation

- Today, we're going to move from this 2D alignment onto 3D alignment, where map 3D points into a 2D image!
    - *cue video from Oculus Quest, which Professor Dellaert casually mentions his grad students worked on*
    - This device actually builds a 3D map of the environment in real-time
        - "In principle, this is what we'll be doing for project 3!"
            - ...*shudder of fear goes through the class*
            - "...I mean, it won't be THIS advanced, but the same ideas apply"

- So, what's pose estimation?
    - POSE ESTIMATION is where we try to take some 2D points in an image that we know the corresponding 3D points for, and try to map a camera matrix that'll fit there
        - The 3D to 2D projection, wayyyyy back from our projection lesson, looks like this:

                X = K * [R|t] * X = PX

    - Geometrically, "t" is a vector from an origin we choose to the camera
        - R, though, is the 3x3 rotation matrix that converts from camera coordinates to absolute world coordinates, with each column mapping to a single camera axis (X/Y/Z)
            - A rotation matrix should be "wRc," then; from the world coordinates to the camera's
                - "Remember, an inverse is just a transpose for a rotation matrix"
        - In homogenous coordinates, we do this mapping from a 3D world point "X" to a 2D camera point "x" as (I THINK?):

                x = K * R * [I|-t] * X = P*X

            - Or, from the camera to the world (might have that backwards?):

                x = K * [R|t] * X = P*X

            - Essentially, though, 
        - How many parameters are in the right side? 11, right - P is a 3x4 matrix, but scale doesn't matter. On the left side, K has 5 DoF (since scale doesn't matter) and the rotation
    - So, what do the columns of P mean?
        - The vector [0, 0, 0, 1] would get us the image of the origin in homogenous coordinates - so, the 4th column is the origin
            - Similarly, the 1st column is at infinity on the X-axis, the 2nd column is the point at infinity on the Y-axis, etc.
            - So, the columns of P are basically the 3 vanishing points of the image's x/y/z coordinates + the arbitrary position of the origin
                - What do the rows mean? That's an exercise I'll leave to you

- So, if we want to do pose estimation, can't we just slap 4 points (the 3 vanishing points and origin) into a 3x4 matrix and call it a day?
    - Unfortunately, NO, since this doesn't work; each point only has 2 degrees of freedom (since it'd a 2D point in the image), so 4 points only gets us 8 DOF - short of our goal of 11
        - So, we need at LEAST 6 points (since this gets us 12 pieces of information, which is >= 11)
        - Once we have 6 points that we know the 3D/2D coordinates for, we look at the 2D points, estimate from our model where the 3D coordinates are, find the least-squares error from the actual guess, and minimize until we get close to the correct "P" matrix
            - You can use nonlinear least squares to do this, which we will NOT ask you to implement since sci-py can do it for you

- Alright, that's pose estimation - let's now start looking at structure from motion!
    - Dealing with multiple views of the object seems like a pain, but if we know where the cameras are then we can actually triangulate points and figure out EXACTLY where they are in the world!
        - Stereo vision kind of reduces to single-depth cues past ~10 meters ("beyond that, you have to use other stuff for depth cues"), but is effective within that range
    - In a stereo camera rig, all we have to do is search horizontally to see how offset the same image's pixels are
        - "In Project 2, to get 2 different perspectives, you CANNOT just rotate the camera - the world doesn't get 'more 3D' by rolling your eyes. You need to actually pick up the camera and MOVE."

- In particular, we'll start by talking about the FUNDAMENTAL MATRIX
    - It's hard to find 3D correspondences between different images, but the good news is that we can check if something is a good match with the fundamental matrix theorem!
    - Suppose we have 2 different 2D views of the same point, P - where can the point be in the world?
        - Well ,the thing we don't know is at what depth that point appears, so it could be anywhere along a ray shooting into the image - and if we have 2 images, and we know where the point is in both images, then we know!
            - What if we only know the point in one image and the actual, 3D location of the point - could we figure out where it appears in the 2nd image? Yes, we could!
                - And if we DON'T know the exact 3D position of the point, we still know that the point has to appear along a certain line in the 2nd image (at a different point for each possible depth the point is at) - and we can calculate that line!

- How the heck do we compute that line of "possible appearances", though? Well, that gets stupidly mathematical with 2D geometry - and we'll do that next lecture!