//****************************************************************************// //************** Deep Learning (cont.) - April 10th, 2019 *******************// //**************************************************************************// - AGH! Professor Riedl isn't here! - Instead, we'll be learning today from Prithviraj, a Phd. student working with Mark - ***WARNING***: I was extremely confused throughout this section (he went through the content a bit too rapidly for me to process) ------------------------------------------------- - We've talked about reinforcement learning, which is VERY useful when we're not able to label all the data (*cough* video games *cough*), making supervised learning difficult - As we've also talked about, reinforcement learning is fundamentally a Markov Decision Process - To speed this up, we'll usually make the slightly-dodgy MARKOV ASSUMPTION, where we assume that the current state ALONE tells us everything about the future state of the game - This usually isn't completely true (previous states can affect the game, etc.), but for many purposes it's good enough - For our "Coin Run"-playing assignment, we want to figure out pieces of the game state (agent's position, number of coins, location of enemies, etc.) so that we can play the game - but we don't have access to the game's API! - Because of that, we're going to need to figure out that state from the raw pixel data on the screen; that's all we have to go off of - Once we've figured out the current game's state (as best we can), we want to do some sort of Q-learning to figure out the best action for that state - If we just consider the game's state as "all possible pixel" combinations, though, that's WAY too many states - our Q-table would take up too much memory to ever be practically solvable! - To get around this, we'll be using a deep neural net to figure - There's a problem with this unsupervised learning, though, that we'll get to, where we're "chasing a moving target" - (something about deep learning based on Q-value, but Q-value based on deep learning results?) - First up, how do we actually parse the pixel data? - As we briefly talked about yesterday, we can use convolutional neural nets to handle this data - For our assignment, this is made a LOT easier because we have the current velocity on the screen, meaning the Markov assumption holds - the current frame has everything we need! - If this wasn't the case, we'd have to estimate the agent's velocity by comparing the current frame with the previous frame - and that can be HARD - With that gist out of the way, let's go through a brief crash-course on how to use PyTorch in "Pytorch_Primer.ipnyb" (might need to look at this: https://colab.research.google.com/drive/1DgkVmi6GksWOByhYVQpyUB4Rk3PUq0Cp) - I know, undergrads don't HAVE to do this, but PyTorch is still a really nice tool to learn about - Let's suppose we're trying to train a basic self-driving car to know when to do 2 things: turn, and accelerate - For our neural net, we need to define the shape of the neural net (how many layers it has, etc.), and a forward function () - We can define a layer using the "Linear(numNeurons, numNeuronsInNextLayer)", and the activation function for the neurons in that layer - That gets us a basic neural network, which is enough for us to calculate our loss function - we'll use an "optimization function" (?) to calculate what we need to do to improve our estimates based on that loss - When we backpropagate, our optimizer will use the loss function to adjust all the weights for neruons in the current layer - If we want, we can check the new parameters/graidents that were computed as a sanity check (just to make sure the weights are actually being updated, and there isn't a silly bug hiding somewhere) - So, that's PyTorch in general, but what we're trying to do for THIS project is use 2D Convolutional Neural Networks (CNN - already, a promising start t our data collections effort) to figure out our images - See here for the assignment/background (had trouble following the TA)(https://github.com/markriedl/coinrun-game-ai-assignment) - - From that, we need to actually train our Q-network using EXPERIENCE REPLAY - See, just learning from consecutive samples of actions is pretty slow; we can end up with bad feedback loops, slowing us down - To address this, EXPERIENCE REPLAY is where we continually update a "replay memory" of transitions we've learned by playing the game (?) - This is implemented using a RING BUFFER, where we store experiences in a circular buffer; when the buffer is full, it'll start overwriting the earliest experiences we've stored (this way, we don't just take up all of our memory) - - So, with all of these individual steps described, let's see what the algorithm looks like when we put them all together (???): Initialize replay memory to capacity N Initialize action-value function Q w/ random weights Initialize target action-value function Q* w/ equal weights #we refresh this w/ weights from the other function regularly? for M runs: Initialize sequence s1={x1} and preprocessed sequence theta1 = theta(s1) for T steps: select a random action w/ some probability (otherwise, choose the best action for theta(st)) execute action and observe reward and "image(?)" x_t+1 Set s_t+1 = s_t,a_t,x_t+1 and preprocess theta_t+1 = theta(s_t+1) store transition(theta_t, a_t, reward_t, theta_t+1) in replay memory sample random "minibatch" of transitions from memory if (, if terminating state) yj = reward else: yj = q-function update perform gradient descent step w/ respect to network parameters Every C steps, reset Q*=Q - "...okay, that was probably a lot to digest, and you look confused, but trust me that it'll make more sense when you start writing the code" - Professor Riedl will go into this in more depth on Monday, so you're being set free 15 minutes early - hurrah! GO FORTH!