//****************************************************************************// //********************* Q-Learning - November 7th, 2019 *********************// //**************************************************************************// - Alright, we have 2 projects left! - The first is going to be building a Q-learning agent - This isn't a whole lot of code, but it's important to get the code that is there right, and it can be a little annoying to debug - The 2nd will be adapting your Q-learning agent so that it works for stock-trading -------------------------------------------------------------------------------- - Alright, last time we started to talk about model-based and model-free reinforcement learning - As we said, this is similar to the difference between value iteration and policy iteration; model-based learning is where we're dropped into an unknown world, and we try and figure out T and R first to calculate a policy, then throw T and R out - Sometimes this is good, but if we're going to throw out T and R anyway, then that's wasted work! We didn't actually care about getting them! - This also involves a lot of experimentation where we're just randomly bumming around the world without getting better until we actually make our policy, which isn't great (particularly if the real-world version of "exploring" is stuff like your expensive robot driving off a cliff, a medical robot going stabby-stabby, etc.) - Model-free learning, then, tries to learn the policy directly as an "online" learner, meaning it gets better IMMEDIATELY - Our transition function, reward functions, and utilities aren't particularly online - can we change that - To solve this, we'll use an online version of our utility function: the Q-function! - The Q-function Q(s,a) represents the total expected rewards (current and future) from: - Being in our current state "s" - Taking an action "a" in state s - Continuing to act according to the policy our Q-function implies - What do we man by "implies?" We'll see... - Using this, we can do Q-LEARNING: the most popular (though definitely NOT the only) model-free reinforcement learning algorithm in existence right now - For now, we'll assume S/A are discrete (although there are variants that avoid this), so that we can create a "Q-table" that gives a utility-like "Q-value" for each state/action pair - Then, at any given time, our policy is just to do the highest-scoring action for the current state we're in! - How do we actually learn the Q-values? Here's the pseudocode version: ``` s = s_0 Q = all 0s or random numbers for each state/action pair # Small random numbers is generally preferred, since 0s can accidentally encourage staying on the same initial path your agent tried first, taking longer to improve (how small depends on the size of your rewards; you want these to be easily overwritten by your rewards, like 1000x smaller than actual rewards) while Q hasn't stopped changing by >= some constant epsilon: a = argmax(Q[s,a]) observe resulting s', reward improve Q using (s,a,s',r) tuple s = s' if s is a goal/terminal state: # Start s = s_0 ``` - That's not too bad! But there's one big black box here: how exactly do we "improve Q?" - "It's not *too* hard, but it's generally the most challenging part of Q-learning" - Basically, it looks like this: Q_new(s,a) = (1 - learning_rate) * Q(s,a) + learning_rate * (r + g*Q(s', argmax(Q[s',a']))) - Where "r" is the reward and "g" is the discount rate for future knowledge (between 0 and 1) - Essentially, we're taking the old value and updating it a little bit with our reward and the value of the best action we could take in the new state - "This should look VERY similar to the Bellman Update equation; over time, as we explore more and more of the world, these Q values will becomes less and less random and more and more meaningful, converging on the best strategy we're aware of" - So, that's the Q-learning process - are there any tweaks we can do to make it better? - One problem is that if we always do the "best action" we're aware of, then we might fall into a local minima and never explore if there's a better option somewhere out there - This is called the EXPLORATION-EXPLOITATION problem, and we can solve this by having our agent randomly take actions every now and then to explore the world - ...after a certain point, though (like going into production), we want our agent to stop doing stupid stuff and just act optimally, so always taking random actions isn't going to be great - "If our agent is playing around near a cliff, that's not the greatest time to check if there's something interesting at the bottom" - To solve this, we'll use the EPSILON-GREEDY algorithm: every X percent of the time, take some random NON-OPTIMAL action - Like the learning rate, we'll usually decrease this over time once we think we've got a pretty good idea of the world - "In your project, these are called `rar` - random action rate - and `radr` - random action decay rate" - So, Q-learning seems great! Why don't we use it for everything? - First off, Q-learning IS awesome - think about it! - Your learner doesn't need to know ANYTHING about the world; the same algorithm will work in any world we put it in, as long as it's aware of its state and its actions - We even have a proof that, given enough examples and a properly-structure problem, Q-learners - However ,there are some issues: - Classic Q-learning requires a discrete state/action space, which isn't practical for every problem AND brings in the "curse of dimensionality:" the more states we have, the less likely it is that we have data to learn each possible state - Stock trading is a good example of something that doesn't discretize easily - For many problems, it's hard to figure out WHY you got a reward; if you lose a chess game, your agent might have problems figuring out which move was its main mistake - Similarly, if we buy a stock position and don't make very much money, where did we go wrong? Did we buy on the wrong day? Was there an uncontrollable market swing? Did we buy too late? Should we have recognized that downturn? - This is known as the CREDIT ASSIGNMENT PROBLEM, and it's hard - Experimentation can also be VERY slow, which for worlds where we only have a limited number of actions in a given time period (e.g. only buying/selling once per day) - Last and probably biggest, experimentation can be VERY costly if we're in the real world - If we're "exploring" in a real stock agent, we'd be losing actual money by letting our agent "explore." If we're using Q-learning for a robot, letting it drive off a cliff for the sake of "exploration" probably wouldn't make our superiors very happy. - For things like stock trading, we could still learn by making a decision and then not executing it (just "pretending" we did), but there's an opportunity cost there: if we WOULD have made money, then we missed our chance! - So, Q-learning is still pretty good, but there's that pesky exploration problem that exploration takes a LONG time, and rewards take a long time to propagate from terminal states through the rest of our table - To help deal with this, we can use something called DYNA-Q! - In modern terms, this is called "experience replay," but a researcher named Rich Sutton was talking about it all the way back in the 1990s - The idea here is that our agent will "hallucinate" many additional experiences after each experience tuple it gets, letting us distribute new experiences through our Q-table more quickly - This is like our agent replaying stuff in its head to let these consequences sink in - "This is NOT faster computationally, but it does make our agent learn from less experiences, which is often VERY useful" - The main drawback is that, to generate these hallucinations, we need to build up a model of the world again - Here's roughly how the algorithm works: - First, do regular Q-learning - Then, learn the world models T/R from our experiences (like an online version of model-based learning) - "Sutton's original suggestion is to keep a count of how many times we took a state/action pair and how many times each possible outcome state happened, and to compute T from that " - Finally, hallucinate a BUNCH of new experiences from this - To do this, we'll pick a random previous state we've seen, a random action we took in that state, and then choose an output state/reward probabilistically based on our model of the world - Then, run our Q-update using that new tuple, just like normal! - Keep doing this until we've got a good enough model, and we're done! - Alright, see you next week!