//****************************************************************************//
//************** Deep Learning (cont.) - April 10th, 2019 *******************//
//**************************************************************************//

- AGH! Professor Riedl isn't here!
    - Instead, we'll be learning today from Prithviraj, a Phd. student working with Mark
- ***WARNING***: I was extremely confused throughout this section (he went through the content a bit too rapidly for me to process)
-------------------------------------------------

- We've talked about reinforcement learning, which is VERY useful when we're not able to label all the data (*cough* video games *cough*), making supervised learning difficult
    - As we've also talked about, reinforcement learning is fundamentally a Markov Decision Process
        - To speed this up, we'll usually make the slightly-dodgy MARKOV ASSUMPTION, where we assume that the current state ALONE tells us everything about the future state of the game
            - This usually isn't completely true (previous states can affect the game, etc.), but for many purposes it's good enough

- For our "Coin Run"-playing assignment, we want to figure out pieces of the game state (agent's position, number of coins, location of enemies, etc.) so that we can play the game - but we don't have access to the game's API!
    - Because of that, we're going to need to figure out that state from the raw pixel data on the screen; that's all we have to go off of
    - Once we've figured out the current game's state (as best we can), we want to do some sort of Q-learning to figure out the best action for that state
        - If we just consider the game's state as "all possible pixel" combinations, though, that's WAY too many states - our Q-table would take up too much memory to ever be practically solvable!
            - To get around this, we'll be using a deep neural net to figure 
    - There's a problem with this unsupervised learning, though, that we'll get to, where we're "chasing a moving target"
        - (something about deep learning based on Q-value, but Q-value based on deep learning results?)

- First up, how do we actually parse the pixel data?
    - As we briefly talked about yesterday, we can use convolutional neural nets to handle this data
        - For our assignment, this is made a LOT easier because we have the current velocity on the screen, meaning the Markov assumption holds - the current frame has everything we need!
            - If this wasn't the case, we'd have to estimate the agent's velocity by comparing the current frame with the previous frame - and that can be HARD

- With that gist out of the way, let's go through a brief crash-course on how to use PyTorch in "Pytorch_Primer.ipnyb" (might need to look at this: https://colab.research.google.com/drive/1DgkVmi6GksWOByhYVQpyUB4Rk3PUq0Cp)
    - I know, undergrads don't HAVE to do this, but PyTorch is still a really nice tool to learn about
    - Let's suppose we're trying to train a basic self-driving car to know when to do 2 things: turn, and accelerate
        - For our neural net, we need to define the shape of the neural net (how many layers it has, etc.), and a forward function ()
            - We can define a layer using the "Linear(numNeurons, numNeuronsInNextLayer)", and the activation function for the neurons in that layer
        - That gets us a basic neural network, which is enough for us to calculate our loss function - we'll use an "optimization function" (?) to calculate what we need to do to improve our estimates based on that loss
            - When we backpropagate, our optimizer will use the loss function to adjust all the weights for neruons in the current layer
            - If we want, we can check the new parameters/graidents that were computed as a sanity check (just to make sure the weights are actually being updated, and there isn't a silly bug hiding somewhere)

- So, that's PyTorch in general, but what we're trying to do for THIS project is use 2D Convolutional Neural Networks (CNN - already, a promising start t our data collections effort) to figure out our images
    - See here for the assignment/background (had trouble following the TA)(https://github.com/markriedl/coinrun-game-ai-assignment)
    - 

- From that, we need to actually train our Q-network using EXPERIENCE REPLAY
    - See, just learning from consecutive samples of actions is pretty slow; we can end up with bad feedback loops, slowing us down
        - To address this, EXPERIENCE REPLAY is where we continually update a "replay memory" of transitions we've learned by playing the game (?)
            - This is implemented using a RING BUFFER, where we store experiences in a circular buffer; when the buffer is full, it'll start overwriting the earliest experiences we've stored (this way, we don't just take up all of our memory)
    - 

- So, with all of these individual steps described, let's see what the algorithm looks like when we put them all together (???):

        Initialize replay memory to capacity N
        Initialize action-value function Q w/ random weights
        Initialize target action-value function Q* w/ equal weights #we refresh this w/ weights from the other function regularly?
        for M runs:
            Initialize sequence s1={x1} and preprocessed sequence theta1 = theta(s1)
            for T steps:
                select a random action w/ some probability (otherwise, choose the best action for theta(st))
                execute action and observe reward and "image(?)" x_t+1
                Set s_t+1 = s_t,a_t,x_t+1 and preprocess theta_t+1 = theta(s_t+1)
                store transition(theta_t, a_t, reward_t, theta_t+1) in replay memory
                sample random "minibatch" of transitions from memory
                if (, if terminating state)
                    yj = reward
                else:
                    yj = q-function update
                perform gradient descent step w/ respect to network parameters
                Every C steps, reset Q*=Q

- "...okay, that was probably a lot to digest, and you look confused, but trust me that it'll make more sense when you start writing the code"
    
- Professor Riedl will go into this in more depth on Monday, so you're being set free 15 minutes early - hurrah! GO FORTH!