//****************************************************************************//
//****************** Q-Learning (cont.) - April 1st, 2019 ********************//
//**************************************************************************//
- As I'm sure you know, Exam 2 is THIS WEDNESDAY, i.e. 2 days from now - it's open note, covers everything we've done between Exam 1 and Exam 2 EXCEPT for Reinforcement Learning (Exam 1 started *just* before we finished doing decision-making)
- Homework 7 is done, and thus, homework 8 begins!
- The first half of the course wasn't very technically involved; this homework assignment isn't particularly more difficult than the others, but it IS more mathematically involved, and uses "Tabular Reinforcement Learning"
- For the 4731 people (the non-grad students), you'll be teaching your agent to play a grid-based game
- There'll be walls, cells that damage your robot, some "people" we need to collect, and an enemy that chases us around
- You'll train a "Controller" agent, which'll iterate for a specified number of iterations; it'll then go into "interactive mode," where you can play as the enemy robot and chase your robot around
- By default, the enemy either acts randomly or according to predefined rules
- Your job is top implement Q-learning in the controller, then fiddle with the agent parameters
- You will ALSO have to write a very brief report where you explain why the agent does certain things (e.g. why the agent often learns to walk through lava, etc.)
- For the grad students, then, you'll be training agents to play a simple level of a game called COIN RUN (created by OpenAI), trying to get your agent to beat it in under 50 timesteps
- Specifically, we'll be looking to make sure your code converges to the optimal solution (even if it takes ridiculously long to do so)
- for Professor Riedl's solution, the agent randomly does 10,000 actions to map out its possible options and rewards
- To make the problem easier, this version of the game has a box that changes color based on the agent's velocity, making it easier for the neural net to "learn" the agent's velocity
- The agent will start acting pretty decently after ~1 million timesteps, but that'll take FOREVER on a laptop
- To speed this up, we'll instead use Jupyter Notebooks on Google CoLab to run your code on some GPU-enabled computers in the cloud
- The Jupyter Notebook code is basically like a Python interpreter with markdown mixed in
- If you're okay with not seeing the graphics, you can run the entire thing on CoLab without having to ever run the files directly on your own computer
- On the Notebook itself, there'll be some boxes where you write your own code, just like for the other assignments
- Once the code is done, and the model has been trained to your satisfaction, you'll download the finally trained model and submit it on Canvas
- So, this homework is more complicated because of the neural-net component, and will also be tested directly by Professor Riedl
--------------------------------------------------------------------------
- SO, to do these fantastic homework assignments, we need to actually learn a bit more about reinforcement learning!
- Last week, we were doing a type of RL called Q-LEARNING, which requires us to know the game's states, possible actions, and rewards - but assumes we DON'T have the transition function that tells us the rules("what happens" for a given action)/what our opponent is doing
- From this, we can derive our UTILITY, or "Q" value: how much total score we expect to get for a given state/action pair (e.g. "If you're at the bottom of the pit, and you jump, you'll probably get a score of 57.3")
- To learn these values for a given state/action, we use the Q-UPDATE function, which updates our Q-table of Q values
- Basically, this function updates the utility for doing an action at a certain time/place using the current reward and the likely future rewards from the new state (along with subtracting the current utility, just to make sure we don't update things too quickly)
- So, that's what we do when we're updating our Q-value...ALMOST
- If we're at a terminal state (the goal, or death, or a YOU LOSE enemy), we shouldn't take future rewards into account!
- For these game-ending states, the new reward should be:
Q(s_t, a_t) = Q(s_t, a_t) + a*(r_t+1 - Q(s_t, a_t))
- "Technically, this isn't required for Homework 8 to work well (since there's no terminal state), but I wrote a unit test for it, so hey - you're doing it!"
- To train our agent, it has to play the game a lot of times.
- ...you'll notice I'm still talking, so it isn't quite that simple
- If we had our agent just randomly take actions, we'll EVENTUALLY see everything, but it'll take a VERY long time
- You basically need to wait until you hit a terminal state to start getting non-zero Q-values for stuff, and until then the agent has NO idea what it's doing
- So, if we're not done learning, just taking RANDOM actions every time is inefficient, and results in us having no guarantee of seeing states that are far away from our initial states
- ...which means random times increase exponentially with the state space
- So, let's instead try doing the ON-POLICY, where we ALWAYS perform the most optimal action that we know about
- Initially, all of our Q-table values are 0, meaning our agent will take random actions
- Eventually, though, the agent will see the same state twice, updating its Q-value - and if it did something that got it a positive reward, it'll do the same action it took last time!
- This is GOOD because it gets us a positive reward, but it means that once we get just a little bit of reward, we'll stick to that path hyper-conservatively - but what if that path is a local minima? What if there's a HUGE reward just a little bit off the path we're on?
- So, random is good because we get to see a lot of stuff, and on-policy let's us see the same states multiple times, letting us update their values multiple times and refine our plan
- So, to combine these two, we'll use a hybrid plan called the EPSILON-GREEDY scheme:
epsilon = some number we choose between 0 and 1
if random < epsilon:
do random actions
else:
do the optimal on-policy action
- This scheme means we'll still zero-in on the best actions, but we'll keep exploring around a bit, too!
- Alright, that should be enough to start doing tabular reinforcement learning - in the meantime, prepare for the exam on Wednesday! Good luck!