//****************************************************************************// //********* Machine Learning Basics (cont.) - September 10th, 2019 **********// //**************************************************************************// - Right; I think things well, but I can't count my chickens before they hatch. - So, today we're hoping to finish doing our lightning introduction to ML, and then we'll get started on what you need to know for project 3 and assessing learners - We'll talk about types of learning, and then about "ensemble learners" - On Thursday, we'll then do a deep-dive tutorial on everything you need - Next week on Tuesday, Professor Byrd will be away visiting J.P. Morgan, so there'll be a guest lecturer from a GT postdoc (who's actually taught ML here at Georgia Tech before) - Professor Isbell tends to teach ML via narrative, while Brian (the guest lecturer) takes a more formal, mathematical approach - Project 1 should finish being graded tonight; "if they don't, I'll increasingly get more agitated at them" - There were sadly several cases of cheating caught; luckily, we have a TA whose sole job now is to deal with that - If you don't like the grade you got, you have a week to make a regrade request - Also: the class after us has an exam today, so we need to get out of dodge FAST after this! -------------------------------------------------------------------------------- So, last time we were talking about generative vs discriminative learners - Basically, the discriminative learners are just chopping the input space into regions, and saying "this region means 'Cat image', that region means 'Dog', etc." - This means we can perform better on some complex problem, but we can't sample from it, since we don't know how "dense" each region is/how likely a given input is - Generative learners, on the other hand, try to learn the entire distribution - This is great since we can sample from it, and generate reasonable models with very small amounts of data, but it takes SIGNIFICANTLY longer to train for complex problem - Another distinction between ML systems is the type of learning it does - BATCH learners need to train on all the data at once; if we add new data, we need to start all over again - The are *usually* more accurate, but you need to retrain all in one go - Polynomial regression is an example of this - ONLINE learners, though, can train on successive samples as they arrive without starting over - This is necessary for a LOT of applications - KNN is an example of online learners - Alright, those are ways of classifying learning algorithms, but what IS learning? - In an ML context, LEARNING is just "function approximation from data" - This isn't *quite* true for unsupervised learning, but it's close for most other types - For unsupervised learning, function - Now, we mentioned last time that parameters are the stuff our algorithm can tweak, while hyperparameters are things we need to provide that control the learner - "Hyperparameter search" is usually trying to search for the hyperparameters that get us the best model - ...okay, let's talk about our first learning algorithm example: KNN! - K-Nearest Neighbors is an instance learner that stores all the data points it's given - "Running an example of this is a pretty likely exam question" - So, let's say we give the algorithm 5 data points: - A (2,1) - B (4,3) - C (5,2) - D (7,2) - E (9,3) - Given that, let's suppose "K"=1, and we want to estimate the Y-value at X=1 - The closest point is "A", and the Y-value of "A" is 1, so we'd return 1! - If K=2, then we'd take the two closest points on the X-axis (A/B) and average their Y values ((1+3)/2 = 2), and return that - What if we need just 1 neighbor, and there are 2+ options that're an equal distance away? That's implementation specific! - If it's a regression problem, we'll usually take the mean between the two options and use that - If it's a classification problem, we can't exactly take the average of "dog" or "cat," so we'll usually just pick the first point, or a random choice, etc. - For classification problems, we'll also take the MODE instead of the mean, since we can't take the average of different categories - Because of the sum being based on the closest neighbors, the output function for this looks like a step function, which jumps when there's a new closest point/2nd closest point/etc. - If K=2, then it'd be when the farthest current neighbor gets farther away than the next non-neighbor - If K=N, we're just going to get a horizontal line for everything, so K is really important - Too low, and the step function will be really noisy - Too high, and we'll lose too much data - The KEY WEAKNESS with KNN, because of this, is that it can't extrapolate! if we give "x = 10000000", it'll still use "E" as the closest neighbor! - So, it can't detect trends outside of the existing data; but it is good at interpolating noisy data - Now, KNN is so simple it's too dumb to be used that often; KERNEL REGRESSION, on the other hand, is a version of KNN that weighs points differently using a "kernel" (a function that decides how different points/neighbors should be weighted) - Another common learning algorithm is LINEAR REGRESSION - Usually, when we talk about this, we're talking about multivariate regression, like so: y = m0 + m1 * x1 + m2 * x2 + m3 * x3 + ... - Here, the learning task is simple: we're trying to find a hyperplane (a flat surface in N-1 dimensional space) that best fits the data we have - By "best fit," we mean that it minimizes our loss function (i.e. how different our predictions are from reality) - So, the parameters we're trying to learn here are m1/m2/etc., - This is MUCH slower than training KNN (which doesn't even need training), since we need gradient descent to learn the parameters - On the other hand, querying linear regression is much faster than KNN, since we don't need to calculate distances to all the points; we just need to plug in all the Xs! - So, KNN might be better if we're adding new data constantly but querying infrequently - Alright: what do you need to know for project 3? - First off, how do we get these (x,y) training tuples from our stock data? - Most learning algorithms have a poor conception of time/predicting the future/etc; there are time-sensitive learners where the order we send data matters, but they're not the norm - So, how do we do this in terms of "when I see X, predict Y"? Well, we'll need to do some sort of offsetting! - Let's suppose we have the following stock data for APPL: Price/SMA20 | BBRatio | Price 0.94 0.41 201 0.96 0.44 204 0.99 0.48 209 1.02 0.52 206 - "SMA20" means the price divided by the average price from the past 20 days - "BBRatio" means "Bollinger Band Ratio," which we'll talk about more later (it basically means how far above the recent price we are, with 0.5 in the middle) - So, we want to predict tomorrow's price using today's data; what should we do? - Well, we'll offset the data so the current day's X features use TOMORROW's price as the Y! - The issue with this is that we'll lose some of our data (how much???) - Next, how do we know if we actually learned something? - In finance, we do this through BACKTESTING! - We first create our nice, shiny model - Then, we start at a date in the past (which we pretend is the current day) - We'll then give the model data from before it's "current" day... - ...and have it make predictions about the "current" day, and compare it to the actual data! - Be VERY careful here; make sure your bot doesn't accidentally see the future! - We'll then step forward one day, give it the next day's data, and repeat - This'll give us the trades our algorithm would've done at that time! - Once we have that list of trades, we'll have a "trading simulator" that can take in our start/end dates and list of trades give us statistics about our portfolio (returns, Sharpe's ratio, etc.), to see how we did - Our 5th project in this class is basically that: given a list of trades, evaluate that strategy on a day-by-day basis - So, backtesting tells us how well our trading strategy worked, but if our strategy wasn't good then we don't know WHY! - It could be that our ML algorithm stinks; it could also be that we have a very good learning algorithm that's been trained on a really bad dataset/training strategy - So, to assess our learner itself (NOT its financial strategy), we'll use a loss function - In this class, we'll use RMSE ("root mean square error"), which is basically what it sounds like: the square root of the sum of all these errors divided by N (to keep the scale close to the original error): sqrt((y_act - y_pred)^2 / N) - Again here, why do we use loss functions instead of just how many times we were right/wrong? - One big reason is because we can weight different errors appropriately; if we're making a cancer detector, false negatives are a LOT worse than false positives, and we want our agent to learn that! - To wrap up today, there are 2 big types of error in machine learning (and especially in finance) - IN-SAMPLE error is the error we get when we're querying on our training data - This lets us check that our learner is actually learning properly; if it can't work on the data it already saw, then we're probably using the wrong type of learner - OUT-OF-SAMPLE error is when we're testing on new, unseen data - This is what we actually care about! - How we assess these two varies depending on the type of learner - If we're working with a batch/non-iterative learner, we'll just split our data into "training" and "testing" groups (often an 80%/20% split after shuffling the data) - It turns out the wine data that's commonly used for teaching has their win rankings in chronlogical order - and, sure enough, the more drunk people got, the higher their ratings got! - This is NOT a good idea if your data is time-dependant, so you - For iterative learners, we need to choose other pieces of data to hold out (?) - Alright; we'll talk more about your Project 3 on Thursday!