//****************************************************************************//
//********* Machine Learning Basics (cont.) - September 10th, 2019 **********//
//**************************************************************************//
- Right; I think things well, but I can't count my chickens before they hatch.
- So, today we're hoping to finish doing our lightning introduction to ML, and then we'll get started on what you need to know for project 3 and assessing learners
- We'll talk about types of learning, and then about "ensemble learners"
- On Thursday, we'll then do a deep-dive tutorial on everything you need
- Next week on Tuesday, Professor Byrd will be away visiting J.P. Morgan, so there'll be a guest lecturer from a GT postdoc (who's actually taught ML here at Georgia Tech before)
- Professor Isbell tends to teach ML via narrative, while Brian (the guest lecturer) takes a more formal, mathematical approach
- Project 1 should finish being graded tonight; "if they don't, I'll increasingly get more agitated at them"
- There were sadly several cases of cheating caught; luckily, we have a TA whose sole job now is to deal with that
- If you don't like the grade you got, you have a week to make a regrade request
- Also: the class after us has an exam today, so we need to get out of dodge FAST after this!
--------------------------------------------------------------------------------
So, last time we were talking about generative vs discriminative learners
- Basically, the discriminative learners are just chopping the input space into regions, and saying "this region means 'Cat image', that region means 'Dog', etc."
- This means we can perform better on some complex problem, but we can't sample from it, since we don't know how "dense" each region is/how likely a given input is
- Generative learners, on the other hand, try to learn the entire distribution
- This is great since we can sample from it, and generate reasonable models with very small amounts of data, but it takes SIGNIFICANTLY longer to train for complex problem
- Another distinction between ML systems is the type of learning it does
- BATCH learners need to train on all the data at once; if we add new data, we need to start all over again
- The are *usually* more accurate, but you need to retrain all in one go
- Polynomial regression is an example of this
- ONLINE learners, though, can train on successive samples as they arrive without starting over
- This is necessary for a LOT of applications
- KNN is an example of online learners
- Alright, those are ways of classifying learning algorithms, but what IS learning?
- In an ML context, LEARNING is just "function approximation from data"
- This isn't *quite* true for unsupervised learning, but it's close for most other types
- For unsupervised learning, function
- Now, we mentioned last time that parameters are the stuff our algorithm can tweak, while hyperparameters are things we need to provide that control the learner
- "Hyperparameter search" is usually trying to search for the hyperparameters that get us the best model
- ...okay, let's talk about our first learning algorithm example: KNN!
- K-Nearest Neighbors is an instance learner that stores all the data points it's given
- "Running an example of this is a pretty likely exam question"
- So, let's say we give the algorithm 5 data points:
- A (2,1)
- B (4,3)
- C (5,2)
- D (7,2)
- E (9,3)
- Given that, let's suppose "K"=1, and we want to estimate the Y-value at X=1
- The closest point is "A", and the Y-value of "A" is 1, so we'd return 1!
- If K=2, then we'd take the two closest points on the X-axis (A/B) and average their Y values ((1+3)/2 = 2), and return that
- What if we need just 1 neighbor, and there are 2+ options that're an equal distance away? That's implementation specific!
- If it's a regression problem, we'll usually take the mean between the two options and use that
- If it's a classification problem, we can't exactly take the average of "dog" or "cat," so we'll usually just pick the first point, or a random choice, etc.
- For classification problems, we'll also take the MODE instead of the mean, since we can't take the average of different categories
- Because of the sum being based on the closest neighbors, the output function for this looks like a step function, which jumps when there's a new closest point/2nd closest point/etc.
- If K=2, then it'd be when the farthest current neighbor gets farther away than the next non-neighbor
- If K=N, we're just going to get a horizontal line for everything, so K is really important
- Too low, and the step function will be really noisy
- Too high, and we'll lose too much data
- The KEY WEAKNESS with KNN, because of this, is that it can't extrapolate! if we give "x = 10000000", it'll still use "E" as the closest neighbor!
- So, it can't detect trends outside of the existing data; but it is good at interpolating noisy data
- Now, KNN is so simple it's too dumb to be used that often; KERNEL REGRESSION, on the other hand, is a version of KNN that weighs points differently using a "kernel" (a function that decides how different points/neighbors should be weighted)
- Another common learning algorithm is LINEAR REGRESSION
- Usually, when we talk about this, we're talking about multivariate regression, like so:
y = m0 + m1 * x1 + m2 * x2 + m3 * x3 + ...
- Here, the learning task is simple: we're trying to find a hyperplane (a flat surface in N-1 dimensional space) that best fits the data we have
- By "best fit," we mean that it minimizes our loss function (i.e. how different our predictions are from reality)
- So, the parameters we're trying to learn here are m1/m2/etc.,
- This is MUCH slower than training KNN (which doesn't even need training), since we need gradient descent to learn the parameters
- On the other hand, querying linear regression is much faster than KNN, since we don't need to calculate distances to all the points; we just need to plug in all the Xs!
- So, KNN might be better if we're adding new data constantly but querying infrequently
- Alright: what do you need to know for project 3?
- First off, how do we get these (x,y) training tuples from our stock data?
- Most learning algorithms have a poor conception of time/predicting the future/etc; there are time-sensitive learners where the order we send data matters, but they're not the norm
- So, how do we do this in terms of "when I see X, predict Y"? Well, we'll need to do some sort of offsetting!
- Let's suppose we have the following stock data for APPL:
Price/SMA20 | BBRatio | Price
0.94 0.41 201
0.96 0.44 204
0.99 0.48 209
1.02 0.52 206
- "SMA20" means the price divided by the average price from the past 20 days
- "BBRatio" means "Bollinger Band Ratio," which we'll talk about more later (it basically means how far above the recent price we are, with 0.5 in the middle)
- So, we want to predict tomorrow's price using today's data; what should we do?
- Well, we'll offset the data so the current day's X features use TOMORROW's price as the Y!
- The issue with this is that we'll lose some of our data (how much???)
- Next, how do we know if we actually learned something?
- In finance, we do this through BACKTESTING!
- We first create our nice, shiny model
- Then, we start at a date in the past (which we pretend is the current day)
- We'll then give the model data from before it's "current" day...
- ...and have it make predictions about the "current" day, and compare it to the actual data!
- Be VERY careful here; make sure your bot doesn't accidentally see the future!
- We'll then step forward one day, give it the next day's data, and repeat
- This'll give us the trades our algorithm would've done at that time!
- Once we have that list of trades, we'll have a "trading simulator" that can take in our start/end dates and list of trades give us statistics about our portfolio (returns, Sharpe's ratio, etc.), to see how we did
- Our 5th project in this class is basically that: given a list of trades, evaluate that strategy on a day-by-day basis
- So, backtesting tells us how well our trading strategy worked, but if our strategy wasn't good then we don't know WHY!
- It could be that our ML algorithm stinks; it could also be that we have a very good learning algorithm that's been trained on a really bad dataset/training strategy
- So, to assess our learner itself (NOT its financial strategy), we'll use a loss function
- In this class, we'll use RMSE ("root mean square error"), which is basically what it sounds like: the square root of the sum of all these errors divided by N (to keep the scale close to the original error):
sqrt((y_act - y_pred)^2 / N)
- Again here, why do we use loss functions instead of just how many times we were right/wrong?
- One big reason is because we can weight different errors appropriately; if we're making a cancer detector, false negatives are a LOT worse than false positives, and we want our agent to learn that!
- To wrap up today, there are 2 big types of error in machine learning (and especially in finance)
- IN-SAMPLE error is the error we get when we're querying on our training data
- This lets us check that our learner is actually learning properly; if it can't work on the data it already saw, then we're probably using the wrong type of learner
- OUT-OF-SAMPLE error is when we're testing on new, unseen data
- This is what we actually care about!
- How we assess these two varies depending on the type of learner
- If we're working with a batch/non-iterative learner, we'll just split our data into "training" and "testing" groups (often an 80%/20% split after shuffling the data)
- It turns out the wine data that's commonly used for teaching has their win rankings in chronlogical order - and, sure enough, the more drunk people got, the higher their ratings got!
- This is NOT a good idea if your data is time-dependant, so you
- For iterative learners, we need to choose other pieces of data to hold out (?)
- Alright; we'll talk more about your Project 3 on Thursday!