//****************************************************************************//
//********* Machine Learning Basics (cont.) - September 10th, 2019 **********//
//**************************************************************************//

- Right; I think things well, but I can't count my chickens before they hatch.

- So, today we're hoping to finish doing our lightning introduction to ML, and then we'll get started on what you need to know for project 3 and assessing learners
    - We'll talk about types of learning, and then about "ensemble learners"
    - On Thursday, we'll then do a deep-dive tutorial on everything you need 
- Next week on Tuesday, Professor Byrd will be away visiting J.P. Morgan, so there'll be a guest lecturer from a GT postdoc (who's actually taught ML here at Georgia Tech before)
    - Professor Isbell tends to teach ML via narrative, while Brian (the guest lecturer) takes a more formal, mathematical approach

- Project 1 should finish being graded tonight; "if they don't, I'll increasingly get more agitated at them"
    - There were sadly several cases of cheating caught; luckily, we have a TA whose sole job now is to deal with that
    - If you don't like the grade you got, you have a week to make a regrade request

- Also: the class after us has an exam today, so we need to get out of dodge FAST after this!
--------------------------------------------------------------------------------

So, last time we were talking about generative vs discriminative learners
    - Basically, the discriminative learners are just chopping the input space into regions, and saying "this region means 'Cat image', that region means 'Dog', etc."
        - This means we can perform better on some complex problem, but we can't sample from it, since we don't know how "dense" each region is/how likely a given input is
    - Generative learners, on the other hand, try to learn the entire distribution
        - This is great since we can sample from it, and generate reasonable models with very small amounts of data, but it takes SIGNIFICANTLY longer to train for complex problem

- Another distinction between ML systems is the type of learning it does
    - BATCH learners need to train on all the data at once; if we add new data, we need to start all over again
        - The are *usually* more accurate, but you need to retrain all in one go
        - Polynomial regression is an example of this
    - ONLINE learners, though, can train on successive samples as they arrive without starting over
        - This is necessary for a LOT of applications
        - KNN is an example of online learners

- Alright, those are ways of classifying learning algorithms, but what IS learning?
    - In an ML context, LEARNING is just "function approximation from data"
        - This isn't *quite* true for unsupervised learning, but it's close for most other types
        - For unsupervised learning, function 
    - Now, we mentioned last time that parameters are the stuff our algorithm can tweak, while hyperparameters are things we need to provide that control the learner
        - "Hyperparameter search" is usually trying to search for the hyperparameters that get us the best model

- ...okay, let's talk about our first learning algorithm example: KNN!
    - K-Nearest Neighbors is an instance learner that stores all the data points it's given
        - "Running an example of this is a pretty likely exam question"
    - So, let's say we give the algorithm 5 data points:
        - A (2,1)
        - B (4,3)
        - C (5,2)
        - D (7,2)
        - E (9,3)
    - Given that, let's suppose "K"=1, and we want to estimate the Y-value at X=1
        - The closest point is "A", and the Y-value of "A" is 1, so we'd return 1!
        - If K=2, then we'd take the two closest points on the X-axis (A/B) and average their Y values ((1+3)/2 = 2), and return that
            - What if we need just 1 neighbor, and there are 2+ options that're an equal distance away? That's implementation specific!
                - If it's a regression problem, we'll usually take the mean between the two options and use that
                - If it's a classification problem, we can't exactly take the average of "dog" or "cat," so we'll usually just pick the first point, or a random choice, etc.
        - For classification problems, we'll also take the MODE instead of the mean, since we can't take the average of different categories
    - Because of the sum being based on the closest neighbors, the output function for this looks like a step function, which jumps when there's a new closest point/2nd closest point/etc.
        - If K=2, then it'd be when the farthest current neighbor gets farther away than the next non-neighbor
        - If K=N, we're just going to get a horizontal line for everything, so K is really important
            - Too low, and the step function will be really noisy
            - Too high, and we'll lose too much data
    - The KEY WEAKNESS with KNN, because of this, is that it can't extrapolate! if we give "x = 10000000", it'll still use "E" as the closest neighbor!
        - So, it can't detect trends outside of the existing data; but it is good at interpolating noisy data
- Now, KNN is so simple it's too dumb to be used that often; KERNEL REGRESSION, on the other hand, is a version of KNN that weighs points differently using a "kernel" (a function that decides how different points/neighbors should be weighted)

- Another common learning algorithm is LINEAR REGRESSION
    - Usually, when we talk about this, we're talking about multivariate regression, like so:

            y = m0 + m1 * x1 + m2 * x2 + m3 * x3 + ...

    - Here, the learning task is simple: we're trying to find a hyperplane (a flat surface in N-1 dimensional space) that best fits the data we have
        - By "best fit," we mean that it minimizes our loss function (i.e. how different our predictions are from reality)
    - So, the parameters we're trying to learn here are m1/m2/etc., 
        - This is MUCH slower than training KNN (which doesn't even need training), since we need gradient descent to learn the parameters
        - On the other hand, querying linear regression is much faster than KNN, since we don't need to calculate distances to all the points; we just need to plug in all the Xs!
            - So, KNN might be better if we're adding new data constantly but querying infrequently

- Alright: what do you need to know for project 3?
    - First off, how do we get these (x,y) training tuples from our stock data?
        - Most learning algorithms have a poor conception of time/predicting the future/etc; there are time-sensitive learners where the order we send data matters, but they're not the norm
            - So, how do we do this in terms of "when I see X, predict Y"? Well, we'll need to do some sort of offsetting!
        - Let's suppose we have the following stock data for APPL:

                    Price/SMA20 |   BBRatio |   Price
                        0.94        0.41        201
                        0.96        0.44        204
                        0.99        0.48        209
                        1.02        0.52        206

            - "SMA20" means the price divided by the average price from the past 20 days
            - "BBRatio" means "Bollinger Band Ratio," which we'll talk about more later (it basically means how far above the recent price we are, with 0.5 in the middle)
        - So, we want to predict tomorrow's price using today's data; what should we do?
            - Well, we'll offset the data so the current day's X features use TOMORROW's price as the Y!
                - The issue with this is that we'll lose some of our data (how much???)
    - Next, how do we know if we actually learned something?
        - In finance, we do this through BACKTESTING!
            - We first create our nice, shiny model
            - Then, we start at a date in the past (which we pretend is the current day)
            - We'll then give the model data from before it's "current" day...
                - ...and have it make predictions about the "current" day, and compare it to the actual data!
                - Be VERY careful here; make sure your bot doesn't accidentally see the future!
            - We'll then step forward one day, give it the next day's data, and repeat
        - This'll give us the trades our algorithm would've done at that time!
            - Once we have that list of trades, we'll have a "trading simulator" that can take in our start/end dates and list of trades give us statistics about our portfolio (returns, Sharpe's ratio, etc.), to see how we did
                - Our 5th project in this class is basically that: given a list of trades, evaluate that strategy on a day-by-day basis

- So, backtesting tells us how well our trading strategy worked, but if our strategy wasn't good then we don't know WHY!
    - It could be that our ML algorithm stinks; it could also be that we have a very good learning algorithm that's been trained on a really bad dataset/training strategy
    - So, to assess our learner itself (NOT its financial strategy), we'll use a loss function
        - In this class, we'll use RMSE ("root mean square error"), which is basically what it sounds like: the square root of the sum of all these errors divided by N (to keep the scale close to the original error):

                sqrt((y_act - y_pred)^2 / N)

        - Again here, why do we use loss functions instead of just how many times we were right/wrong?
            - One big reason is because we can weight different errors appropriately; if we're making a cancer detector, false negatives are a LOT worse than false positives, and we want our agent to learn that!

- To wrap up today, there are 2 big types of error in machine learning (and especially in finance)
    - IN-SAMPLE error is the error we get when we're querying on our training data
        - This lets us check that our learner is actually learning properly; if it can't work on the data it already saw, then we're probably using the wrong type of learner
    - OUT-OF-SAMPLE error is when we're testing on new, unseen data
        - This is what we actually care about!
    - How we assess these two varies depending on the type of learner
        - If we're working with a batch/non-iterative learner, we'll just split our data into "training" and "testing" groups (often an 80%/20% split after shuffling the data)
            - It turns out the wine data that's commonly used for teaching has their win rankings in chronlogical order - and, sure enough, the more drunk people got, the higher their ratings got!
            - This is NOT a good idea if your data is time-dependant, so you
        - For iterative learners, we need to choose other pieces of data to hold out (?)

- Alright; we'll talk more about your Project 3 on Thursday!