Jake's CS Notes

//****************************************************************************//
//****************** Q-Learning (cont.) - April 1st, 2019 ********************//
//**************************************************************************//

- As I'm sure you know, Exam 2 is THIS WEDNESDAY, i.e. 2 days from now - it's open note, covers everything we've done between Exam 1 and Exam 2 EXCEPT for Reinforcement Learning (Exam 1 started *just* before we finished doing decision-making)
- Homework 7 is done, and thus, homework 8 begins!
    - The first half of the course wasn't very technically involved; this homework assignment isn't particularly more difficult than the others, but it IS more mathematically involved, and uses "Tabular Reinforcement Learning"
    - For the 4731 people (the non-grad students), you'll be teaching your agent to play a grid-based game
        - There'll be walls, cells that damage your robot, some "people" we need to collect, and an enemy that chases us around
        - You'll train a "Controller" agent, which'll iterate for a specified number of iterations; it'll then go into "interactive mode," where you can play as the enemy robot and chase your robot around
            - By default, the enemy either acts randomly or according to predefined rules
        - Your job is top implement Q-learning in the controller, then fiddle with the agent parameters
            - You will ALSO have to write a very brief report where you explain why the agent does certain things (e.g. why the agent often learns to walk through lava, etc.)
    - For the grad students, then, you'll be training agents to play a simple level of a game called COIN RUN (created by OpenAI), trying to get your agent to beat it in under 50 timesteps
        - Specifically, we'll be looking to make sure your code converges to the optimal solution (even if it takes ridiculously long to do so)
        - for Professor Riedl's solution, the agent randomly does 10,000 actions to map out its possible options and rewards
            - To make the problem easier, this version of the game has a box that changes color based on the agent's velocity, making it easier for the neural net to "learn" the agent's velocity
        - The agent will start acting pretty decently after ~1 million timesteps, but that'll take FOREVER on a laptop
            - To speed this up, we'll instead use Jupyter Notebooks on Google CoLab to run your code on some GPU-enabled computers in the cloud
                - The Jupyter Notebook code is basically like a Python interpreter with markdown mixed in
                - If you're okay with not seeing the graphics, you can run the entire thing on CoLab without having to ever run the files directly on your own computer
            - On the Notebook itself, there'll be some boxes where you write your own code, just like for the other assignments
        - Once the code is done, and the model has been trained to your satisfaction, you'll download the finally trained model and submit it on Canvas
        - So, this homework is more complicated because of the neural-net component, and will also be tested directly by Professor Riedl
--------------------------------------------------------------------------

- SO, to do these fantastic homework assignments, we need to actually learn a bit more about reinforcement learning!
    - Last week, we were doing a type of RL called Q-LEARNING, which requires us to know the game's states, possible actions, and rewards - but assumes we DON'T have the transition function that tells us the rules("what happens" for a given action)/what our opponent is doing
    - From this, we can derive our UTILITY, or "Q" value: how much total score we expect to get for a given state/action pair (e.g. "If you're at the bottom of the pit, and you jump, you'll probably get a score of 57.3")
        - To learn these values for a given state/action, we use the Q-UPDATE function, which updates our Q-table of Q values
        - Basically, this function updates the utility for doing an action at a certain time/place using the current reward and the likely future rewards from the new state (along with subtracting the current utility, just to make sure we don't update things too quickly)

- So, that's what we do when we're updating our Q-value...ALMOST
    - If we're at a terminal state (the goal, or death, or a YOU LOSE enemy), we shouldn't take future rewards into account!
        - For these game-ending states, the new reward should be:

            Q(s_t, a_t) = Q(s_t, a_t) + a*(r_t+1 - Q(s_t, a_t))

            - "Technically, this isn't required for Homework 8 to work well (since there's no terminal state), but I wrote a unit test for it, so hey - you're doing it!"
        
- To train our agent, it has to play the game a lot of times.
    - ...you'll notice I'm still talking, so it isn't quite that simple
    - If we had our agent just randomly take actions, we'll EVENTUALLY see everything, but it'll take a VERY long time 
        - You basically need to wait until you hit a terminal state to start getting non-zero Q-values for stuff, and until then the agent has NO idea what it's doing
    - So, if we're not done learning, just taking RANDOM actions every time is inefficient, and results in us having no guarantee of seeing states that are far away from our initial states
        - ...which means random times increase exponentially with the state space
    - So, let's instead try doing the ON-POLICY, where we ALWAYS perform the most optimal action that we know about
        - Initially, all of our Q-table values are 0, meaning our agent will take random actions
        - Eventually, though, the agent will see the same state twice, updating its Q-value - and if it did something that got it a positive reward, it'll do the same action it took last time!
            - This is GOOD because it gets us a positive reward, but it means that once we get just a little bit of reward, we'll stick to that path hyper-conservatively - but what if that path is a local minima? What if there's a HUGE reward just a little bit off the path we're on?
    - So, random is good because we get to see a lot of stuff, and on-policy let's us see the same states multiple times, letting us update their values multiple times and refine our plan
    - So, to combine these two, we'll use a hybrid plan called the EPSILON-GREEDY scheme:
        
            epsilon = some number we choose between 0 and 1
            if random < epsilon:
                do random actions
            else:
                do the optimal on-policy action

        - This scheme means we'll still zero-in on the best actions, but we'll keep exploring around a bit, too!

- Alright, that should be enough to start doing tabular reinforcement learning - in the meantime, prepare for the exam on Wednesday! Good luck!