//****************************************************************************// //******************** Local Search - July 17th, 2018 ***************// //**************************************************************************// - Once more, at 2:35pm, we see no trace of Professor David Byrd - 2:39 - ...still nothing save silence - (...well, as silent as a room full of college students can be) - Ah! Rather quickly marked up on the board, observe: - "Hill Climbing - Simulated Annealing - Genetic Algorithms - Gradient Descent" - "The TAs did finish grading your homeworks, but I forgot to press the big green button that lets you see your grades. Sorry. I'll hit that later tonight" - Your last piece of homework is due Sunday - DON'T FORGET! - Basically, parts I-III deal with Value Iteration, and parts IV - VII deal with Q-learning --------------------------------------------------- - So, last week we went over the Q-function algorithm for solving MDPs, where we only needed to know what state we're and the actions we can take to form a policy - There are ALSO POMDPs - "Partially Observable Markov Decision Processes" - where we don't even have to know for sure which state we're in, just the list of available actions - We won't talk about those in this class, but it's similar to dealing with our noisy-sensor Pacman problem, where we have to make a probability distribution of what state we could be in given our observations - "Apologies for not going over any of that this semester, but summer is short and some things just have to give" - NOW, we're getting into a different topic: neural networks and local search! - Basically, neural networks are too big to exhaustively search, so we instead have to do it by just searching a small portion of the space - "local search" - By FAR, gradient descent is the most popular way of doing this - In a lot of our previous problems, we assumed that we had a discrete space that was small enough for us to remember all the paths we've explored and to search the space in a reasonable amount of time...obviously, for real-world problems, that's not always true - In Search problems, for instance, we assumed we had a fully observable, discrete world of a "reasonably" searchable size - This let us calculate the globally optimum path... - But what if we have a HUGE - even potentially infinite - state space? What if our hardware has some severe memory constraints, or if we have to deal with a continuous state space? - In all of these cases, search quickly becomes either impossible or unrealistically time-consuming - So, for those sorts of problems, we usually can't do an exhaustive search to guarantee the globally optimal solution - what we CAN do is find a locally optimum, "acceptable" solution; this is known as LOCAL SEARCH - So, for what kinds of problems are non-optimal solutions okay? Well, a bunch of things: - NP-complete problems ("We can't calculate the full solution anyway, so bring on the heuristics!") - Optimization problems (ESPECIALLY with continuous parameters) - Alright, this is useful, but can we find these local solutions even given memory constraints far smaller than our state space? - ...well, YES! That's exactly what local search as a field is supposed to do! - In order to solve a local search problem, we need our current state, the successor state, and the cost/objective function we're trying to optimize - The most basic version of this is the HILL-CLIMBING ALGORITHM - just "move to the successor with the highest performance metric until we can't move to anywhere higher" - As an example of this, let's say we have the following Pacman grid: 1 2 9 8 3 1 2 12 4 5 6 10 8 7 6 5 - If we start at (1,1), we'll end up at "8" and then stop - If we start at (1,2), though, it'll go to 9 - So, the issue with this algorithm is that we can end up trapped in local minima, when better solutions are just around the corner - The greatest strength and weakness of this algorithm is that it keeps NO memory of where it's been - that means it requires practically no memory to run, and can run even in infinite problem spaces, but it's doomed to not be able to "escape" when it gets stuck - Let's now look at a problem that's commonly solved with CSP - the "Four Queens" problem - Let's say we have a 4x4 grid with 4 queens randomly placed on the board - can we move the queens to a position so that none of them are attacking one another? - With a small grid like this, we can solve this pretty easily with a CSP - but what if we have a million queens on an NxN board? - That's too big of a state space to search! - With local search, though, we can try to solve this in a less optimal, more efficient way - Let's say we have a cost function equal to # of queens attacking one another, and we want to minimize it - For simplicity, let's say we can only move 1 queen at a time, and that we can only move the queens vertically (i.e. in their column, since if 2 queens shared the same column they'd be attacking) - So, given that, we can then explore all possible successor board states and pick the smallest successor, and repeat - This'll work! And we did it without global search! - ...but there are issues. What if we start in a local minimum and there're no successors with a smaller cost? - If we allow ourselves to move to states with an equivalent cost as the current state ("lateral moves"), there's potential for an infinite loop (we could just hop back-and-forth between the current state and another one), but a chance we might find a way out - To avoid the infinite loop, we could just put a limit on how many lateral steps we take - that's about as good as we can get - So, there're some major problems with a simple hill-climbing algorithm - running this algorithm on the 8-queens problem, 86% of solutions get stuck in a local optimum! - The BIG advantage of this algorithm, though, is its speed - on average, it can find a local minimum within 3-4 steps - So, one improvement we can make to this algorithm is RANDOM RESTARTS - if we end, and haven't reached a globally optimum solution, we reset the state to a random initial state and begin running again - We ALWAYS need a max number of "restarts," however, since there's a chance this algorithm will run into an infinite loop - Now, for the 8-queen problem, an average of 7 restarts is enough to find a 0-cost global solution - just 22 steps total on average! That's pretty good! - "This simple algorithm is actually used more than you'd think" - An alternative to this is "beam search," where you just remember the past 3 or 4 states you were in and backtrack if you get stuck - we won't deal with it in this class, but it's definitely a thing - Now, what if we have a max-optimization problem and we flip it around so the "best" solutions are the minima? - Now, our Hill-Climbing optimization is like putting a marble on the top of the hill and letting it roll down - and what do you do in the real world if the marble gets stuck somewhere? You shake it! - "And shaking the marble is exactly what we're gonna be doing!...in a mathematical sense" - So, when we fall into a local minimum the first time, we "shake" the state a LOT at first to see if it falls back into the state or goes somewhere else; over time, we shake more gently so that we don't knock our state out of our final answer - This algorithm is called SIMULATED ANNEALING, and because "annealing" is a metallurgical term, the vocab makes no sense: we say "Temperature" is how hard we shake the state - Formally, the algorithm can be described as this: 1) T = some starting temperature, S0 = some random initial state 2) Select some random neighboring state and calculate "delta = cost(next) - cost(current)" 3a) If delta < 0, we're moving downhill, so current = next! 3b) If delta >= 0, then if random(0, 1) < e^(-delta/T), we STILL take it - So, the bigger delta is, the less likely we are to take this "bad" exploration move, and the higher the temperature, the more likely it is 4) At some problem-appropriate point, reduce T on some fixed schedule (e.g. reduce T by 1 every step, reduce T by 0.1 every 10 steps, etc.) 5) Loop back to 2 until T <= 0, at which point we stop - "So, annealing is actually a pretty simple algorithm" - GENETIC ALGORITHMS, on the other hand, are really cool - but not actually used for neural networks today, so we'll only go over the basics - The basic idea of G.A. is that it's a biology-inspired meta-algorithm, where we split our code into "DNA strands" that we combine to produce "children" that're better than their parents - So, we have an objective function to evaluate our function, called the "fitness function" - So, going back to our 8-queen problem, we could have a string of 8 column positions (one for each queen) that UNIQUELY and DISCRETELY maps our state to a 1-dimensional array - This'll be our "DNA strand" that we recognize - Another example would be for stock trading, where we could have a 1-letter code for each graph pattern and we have 1 array entry for each "pattern" over the past hour - With G.A., we'll start off with randomly generated ancestor array "strands" (often they're literally strings); we'll then evaluate the fitness of those strings, and select the best "k" individual strands to "breed" - "Technically this is Eugenics, but I think it's okay since it's just math" - K, of course, is a programmer-defined hyperparameter Cost = 3, 1|4|2|4 Cost = 1, 3|2|4|1 - We then generate "children" strings by "mutating" each string by randomly changing one array entry to another valid entry, and "cross-over" substrings at a random location - The distribution should favor taking more stuff from the higher fitness Cost = 3, 1|4|2|4 Cost = 3, 1|4|2|1 | => Cost = 1, 3|2|4|1 Cost = 1, 3|2|4|4 - We continue to do this until the fitness stops improving - Now, this is technically a meta-algorithm, so it can be used in tandem to improve existing algorithm - the main drawback is that you have to be able to map your state into a discrete string (which doesen't work well with infinite states, although you can kind of hack it sub-optimally by choosing subsets) - These're still commonly used in engineering applications, but have been somewhat eclipsed by Neural Networks in recent years - "we're a bit fadish in CS" - Our final form of complicated, mind-melting local search is...linear regression! - Remember: an "online" learner can accept new data points one at a time, while a batch learner needs all of its data at once and has to start over if the data changes - "Batch learners tend to produce better final results, online learners are convenient for most applications - linear regression, with some excepted modifications, is a batch learner" - So, the standard multivariate linear regression equation just looks like this: y = b + m1x1 + m2x2 + ... + mnxn - "This is just defining whatever your dimensional equivalent of a 2D line/ 3D plane/etc. is, letting us sample points from it" - Given some (X,y) training tuples, where "X" = <x1, ..., xn>, we're trying to minimize some error function "err(Yactual, Yestimated)" - What is this "learning," then? It's learning the optimal values of "m"! - The "hyperparameter," then, is "n" - the number of dimensions - Once trained, we can just plug in a given "X" vector to get an estimate "y" for that point - So, how can we find at least a locally optimal solution to train this thing? GRADIENT DESCENT! - Basically, we initialize all of our "M" values to some random values, look at our error function, and find which direction to shift our values and go in that direction - repeat until we stop improving - "For TRULY linear equations, we actually have a closed-form solution for this problem - for anything non-linear, of course, we have to approximate things" - So, given some handy math libraries, we can honestly create a linear regression learner in just a few lines of code: class LinRegLearner: def __init__(self): pass def train(self, trainX:numpyArray, trainY:numpyArray): # So, X would be a 2D array with 1 row per example X vector, and Y would be the answer for each row <missing part we'll go over on Thursday> self.m, self.b = someLibrary.linreg(trainX, trainY) def query(self, testX): return (self.m * testX) + self.b - So, that's the easy part of gradient descent - but what's the hard part? We'll start digging into that on Thursday