Jake's CS Notes

//****************************************************************************//
//******* Deep Reinforcement Learning (cont.) - April 15th, 2019 ************//
//**************************************************************************//

- I got's dem pre-finals drowning-in-work-and-mild-self-pity-and-microwave-dinners sorta blues
    - BUT, here's a shaft of light in my cavern of partly self-imposed procrastination-fueled panic: RIEDL'S BACK!
    - (You know it's gonna be a good lecture when it starts with PC Gamer on the slides)
- "Did everyone see the news this weekend? This Saturday, for the first time ever, an AI-controlled DOTA 2 team beat the best human team in the world!"
    - Obviously their RL algorithm is a little bit more sophisticated, but it's the same ideas as what we're learning abotu
    - They trained it for the equivalent of 45,000 years worth of DOTA 2 games
        - "So, if you're frustrated that it takes an hour to train your homework 8 bot..."
        - Seriously though, deep RL is horribly inefficient; they had to rent 218,000 CPU cores for 10 months to train their 

- BEFORE we move on, though, you should've gotten your Exam 2 grades back; let's go over the answers to that:

    1) HTNs vs A*:
        - HTNs exploit designer intuition
        - HTNs are more efficient
        - A* is optimal
        - A* can find new solutions
        - HTNs can reliably produce behavior patterns
        - HTNs can delay planning parts of the solution

    2) What are reasons for using PCG?
        - YES, saves dev costs
        - NO, it doesn't always produce higher-quality content
        - YES, it does increase replayability if done properly
        - YES, it does allow us to customize the experience

    3) Seeds can be used for:
        - Allowing different people to get the same content
        - Random seeds guarantee unique levels for everyone

    4) PCG helps w/ permadeath so that players don't have to play the exact same content over and over again

    5) Genetic algorithms are NOT guaranteed to create precise numbers of things

    6) ...yeah, this question was worded terribly; we decided to just give everyone full credit for it

    7) Matching different graphs to "flow" trajectories of player talent vs game challenge
        - User improving their skill faster than the game was D
        - Improving slower than the game was A
        - Improving at about the same rate was C
        - Too easy at first, too hard at the end was B

    8) When should you use design-time vs run-time PCG?
        - If you have little computation time but LOTS of storage, design-time!
        - If you have a LOT of computation time and little storage, run-time!
        - If you have little computation time and very little storage, you just can't store anything, so you HAVE to use run-time
        - If you have a lot of storage but an NP-hard problem, you should do design time, since that way you can at least verify your outputs are decent

    9) Micro-structure fitness functions help us to give pacing to a level, and can match to a designer-specified difficulty curve (as opposed to MACRO-structure functions, which can't give structure to the level)

    10) This ended up being an easier question than expected, but you basically just had to match the HTN actions to the output, which was "TranformToVehicle -> Drive -> TranformToRobot -> Shoot"
----------------------------------------------------------------------

- Last week, Raj took over the job of explaining deep reinforcement learning to us - he's VERY knowledgeable about that, but he's also fairly new to lecturing, so let's recap what he should've gone over
    - We've decided that we're working with screens of pixels as our states, which means we need to map from screens of pixels to our Q-values
        - This is identical to our normal Q-learning problem, but there's WAY too many raw pixel states, so we need some way of figuring out which variations of the screens don't matter
            - To do that, we'll use CNNs to train our model, and over time and many backpropagation-iterations it should "learn" which combinations don't affect its results
    
- ...the problem with this, though, is that we have NO idea what Q-value we need to update for a given screen, unlike our tabular learning task
    - So, how do we train this neural network?
        - We've got a screen with width X height pixels, each of which has an RGB component
            - Some people just convert this to grayscale if color doesn't matter to speed up training but for more complex games color might be important
        - What we WANT are the Q-values for any given state/action pair given this screen state
            - Initially, these values are going to be basically white noise, since we don't know what the right Q-values to update are
            - HOWEVER, we still have the Q-update function:

                    Q(s_t, a_t) = Reward + a*max(Q(s_t+1, a))

                - "This is simplified slightly, since the training doesn't need to be as precise"
            - If we think about it, we DID have this same problem in tabular learning, where we didn't know what to do but eventually figured it out as Q improved
                - The difference now is that we're applying that update to the neural net itself!
                - Here, instead of giving the neural net the "right" answers up front, we instead give it some random target Q values and train it to "chase" those numbers
                    - (A little confused here, but okay)
        - So, we're initially training on the WRONG numbers, so our neural net weights are also wrong and needa to be trained first
            - This creates INSTABILITY: since we're training on the wrong numbers, we can overshoot the right answers and accidentally unlearn everything we had
                - We'll need to make this training more stable, then, but we'll come back to that
        - To train any neural net, we need to know the LOSS of our learning
            - In this case, the loss is just the difference between "Q(s_t, a_t)" and "reward + a*max(Q(s_t+1, a))"
                - Once we get that loss, we'll backpropagate the error through our neural net to improve

- Once we've got all this, we have everything we need to begin training in earnest, right? For each step, we'll take an action,get the reward, compute loss and backpropagate, and repeat
    - This'll work, but it's REALLY unstable - so we need to do some weird stuff to get this to work better!

- To fix this, one of the things we'll add is REPLAY MEMORY!
    - One of the BIG reasons our training is so unstable is because after EVERY action, we try to update all of our neural network weights
        - This means we can take an action, get an awesome reward, and update our weights to do that - but if we take another action, get no reward, and change all our weights, we might've just changed them from "really good" to "really bad" - we can undo all of the hard work we've done!
            - This is known as CATASTROPHIC FORGETTING (or, incidentally, what happens the night before every exam)
    - Somehow, we want to remember all of the good stuff we've done in the past and reinforce those weights, even if we might never revisit the same state again
        - To do this, we'll create a buffer of memory that'll store "N" transitions from the past, known as the REPLAY MEMORY
            - A transition is a state/action pair and the resulting outcome state/reward
        - Now, every action we take, we'll store the transition in our memory; and after every action, we'll retrain our neural net on ALL transitions in memory
            - So at each state, we're "reminding" ourselves of everything we've done in that past
    - HOWEVER, we can only train on so many memories before doing this every step becomes ridiculously slow - so how can we challenge that?
        - What we need to do then is choose a small random subset of memories from our replay memories (say 100 or so), and ONLY train on that
            - Now, this is a little screwy - it means that instead of training on the action we just took, we ONLY add that recent action as a memory, and then train on 100 randomly selected memories, hoping that we eventually sample that new memory
                - "It's like if you kept walking into a wall over and over again, and instead of learning to not walk into a wall the agent is thinking 'How can I improve my breakfast routine? How can I improve my note-taking strategies?'"
                    - This is REALLY screwy, since it's not at all how humans learn, but for our computer overlord toddlers-in-training, it works!
        - It is possible to choose a non-random sample to speed up our training (e.g. based on how recently those memories have been used for training)
            - This is known as PRIORITIZED REPLAY MEMORY

- Another thing we can do to help stabilize our learning is BATCHING!
    - Oftentimes, we're training these things on a GPU, and that means we can run a LOT of stuff in parallel
        - That means we can take all of our sample transitions from replay memory and process them ALL on the GPU at once
            - Remember, CPUs are SISD ("single-instruction single-data"), while GPUs are generally SIMD ("single-instruction multiple-data," meaning we do the same operation to many different things)
        - If you think about a neural net, we need to add some amount of loss to each of our weights - which makes them perfect candidates for the GPU!
            - If we have extra GPU memory, then instead of just processing a single screen, we can process multiple of them at once
        - The only slightly screwy thing is that we now need to use BATCH LOSS: if we processed, say, 5 screens all at once, then we're going to use the AVERAGE of all of their 5 loss values for our backpropagation
            - In the long run, this smooths out our learning, and means that while we need more iterations to converge we make fewer backpropagations in the long run, ultimately speeding up our training

- By-and-large, no one has figured out how to get perfectly stable training, and it's still an area of VERY active research
    - The two general approaches to combat this right now are either:
        - a) We remember our best model so far (what we do in the homework), or
        - b) We just train for a VERY long time

- SO, let's put this all together
    - To ACTUALLY train this deep learning monstrosity, we'll do the following:

        initialize replay memory
        initialize DQN to random weights theta
        for each training episode:
            initialize the game
            for time step (1, T):
                with probability epsilon:
                    a_t = pick random action
                else:
                    a_t = argmax(Q(s_t, a|theta)) # this means check our neural net for the Q values most likely to be "relevant" to this state (I THINK?)
                execute a_t and get reward/new state
                store transition (s_t, a_t, s_t+1, reward) in replay memory
                sample a batch from replay memory (s_j, a_j, s_j+1, reward_j)
                update q values (w/ just reward if terminal state, else)
                compute loss
                backpropagate loss through DQN

    - Every few episodes, we'll test our agent to make sure we're actually learning stuff

- Alright, and that's mostly it for reinforcement learning! We'll finish this up on Wednesday, then revisit PCG - stop by then!