//****************************************************************************//
//************** Markov Decision Processes - July 5th, 2018 *****************//
//**************************************************************************//

- The Gridworld grids are in bloom this afternoon on the board...
    - ...and what can't I show easily in ASCII? THAT'S RIGHT...
- Newly-blossomed, adjacent, and textual, however...
    - "MDP (Idea/def/example)
    - Personality
    - Additive/Discounted Utility
    - Utility Calculation
    - Value/Policy Iteration (if time)"
- ...and with that flowered, we reach 2:35

- A few administrative things:
    - Exams will be returned at the end of class
    - Don't forget that the decision tree homework is due on Sunday 
        - The Chi-Squared Pruning question was taken down on Harvard's website, so CMU's site is a nice alternative resource
------------------------------------------------------

- So last time, we wrapped up with expected utility, which really isn't that bad; we just multiply each outcome of that action by its probability and utility, and choose the action with the highest expected utility
    - At the same time, though, we realized that this simple way of calculating utility has problems with long-term prediction (e.g. the agent-lava problem); for those problems, we need MDPs

- So, "Markov Decision Process" are our way of calculating expected utility for a whole sequence of events - how do they work?
    - Well, with a regular utility-based agent, we just choose the action with the highest expected utility for the next step
        - ...and, well, that's what our MDP agents will do! HOWEVER, the utility calculations for these sequences of events get quite a bit more complicated
    - Now, we also started giving the example of a GRIDWORLD problem, where we are 80% likely to go forward like we want and 10% likely to "slip" and move to either side instead
        - If this problem was deterministic, we could just solve it with search; as it is, though, if the shortest path were 5 long, we'd only have (0.8^5 + 0.8*0.1^4) chance of reaching the goal - not great
            - The "+" term comes from the chance that this specific plan on the board goes wrong and still finds a correct solution (for this map, the only way this happens is if it goes right and follows the edge)
    - So, what we need instead is a POLICY FUNCTION: a function that can map EVERY single state to an action
        - This way, even if we end up in a state that our path didn't expect, we know what to do!
        - ...the difficulty, of course, is calculating this policy
            - "Now, we have to be a utility agent to calculate this function upfront, but afterwards we basically become a reflex agent - we just look up what we should do in that state!"
                - Because of this, it's VERY efficient at runtime - we go through a complex policy calculation to end up with a pretty simple agent

- But what do MDPs have to do with this?
    - Well, we call this a "Markov Decision Process" because we ONLY care about what the current state is, similar to what we did for Bayes' Nets
    - Formally, we define an MDP as a problem with:
        - S (set of states)
        - S0 (initial state)
        - Sf (*optional*, terminal/final states)
            - You CAN have an MDP agent that just keeps running forever
        - A (set of actions)
        - T (transition matrix)
            - What's DIFFERENT from search is that our transition matrix is now in the format "T[s,a,s'] = P"
                - In other words, "P(s'|s,a)", or (in even differenter words) "Probability that if we start in state s, and take action a, we will arrive in state s'"; our transition matrix is now probabilistic
        - R (a reward function, maps a given state or state/action pair to some real number "reward")
            - "Computer Scientists got this idea from Psychologists, who liked the mathematical version so much that they stole it back"
    - This is pretty similar to search, but with some tweaks to things
- Well, as it turns out, we can represent these MDPs as two linked "stochastic finite state machine" (i.e. same as a normal FSM, but a given action could result in several possible states with different probabilities)
    - Specifically, it's the linked state machines for the Agent/Environment:

            Agent <------> Environment

        - "The agent doesn't HAVE to be stochastic, but it usually is"
        - Literally, the environment and the agent are usually two completely separate pieces of code; the agent will take an action, the environment will change to some state, the agent will see what the environment is like now, take a new action, etc.
    - So, in GridWorld, we can say our MDP setup is like this:
        - S: all grid coordinates our agent could be in
        - S0: (1,1), the coordinate we start in
        - A: {Up, Down, Left, Right}
            - "Notice that our agent can't just stay still; it has to move at every step"
        - T: a 3D array with [current state, action, next state] coordinates and probabilities
            - So, assuming acction = "Up", if S' is the horizontal coordinate and S is the vertical, a "slice" of the transition matrix might look like this:

                        (1,1)   (1,2)   (1,3)   (2,1)   (2,2)   (2,3)   (3,1)
                (1,1)   0.1     0.8             0.1
                (1,2)           0.2     0.8     
                (1,3)                   0.9                     0.1
                (2,1)   0.1                     0.8                     0.1

            - This is the FORMAL definition of the matrix, but we could, of course, replace it with a function in code (and that's usually exactly what we do)
        - R: Our reward function
            - So, we KNOW (4,3) is our goal, which gives a +1 reward, and (4,2) is something we want to avoid with a -1 reward
                - If we gave every other action 0 weight, then our agent wouldn't know what to do until after 5 steps, since every path would be weight 0 and, thus, just as good
                - More problematically, this utility function would just tell   the agent to find SOME path to the goal, and doesn't give any preference to shorter paths
            - So, to fix this, we could say there's a negative reward (like -0.04) for any moves that don't reach the goal - this tells the agent that there's a small penalty for just sitting there (compared to falling into the pit), so it'll naturally tend towards shorter paths
                - Now, this is a reward function that seems to work - but is it the BEST reward function? How can we know? We'll discuss that over the next few lectures
    
- Now, our agent's ENTIRE behavior is based around this reward function - pretty cool!
    - We can change the agent's "personality" just by changing the reward function
    - For instance, our current GridWorld function is fairly conservative and heavily penalizes the "risky" path of falling into the pit - if we changed the penalty for non-goal moves to a larger number like -0.4, though, then it'll conclude that shorter paths are worth the higher risk, so it'll risk moving past the pit if it's already near it (instead of trying to move back to the safe path, which our current agent would do)
        - If we change it to a REALLY big penality like -1.5, then it becomes a bigger penalty to just keep moving then to fall into the pit - so what'll the agent do? It'll fall into the pit so that it stops moving immediately and stops losing points - it becomes suicidal!
        - If, on the other hand, we change the penalty to a VERY small number (like -0.002), the agent will prioritize living for as long as possible no matter what, so it'll run into walls whenever it's near the pit in the hope it'll "accidentally" move sideways away from the pit - that way, it has a 0% chance of falling in!
        - And if we change the penalty to a POSITIVE number, it'll run away from the pit AND the goal because it'll keep gaining points as long as it stays alive - more points than if it just won right away!
-...and ALL of those interesting changes came just from tweaking one constant in our reward function!

- Alright, we've got this idea of a reward function and a transition matrix, but what is it actually doing with those rewards? How do we actually turn this reward into a policy?
    - In order to do this, as we said, we need a way that'll look at a SEQUENCE of actions so that our agent can plan for the future
        - We've said earlier that rational agents act "expected optimally;" we can restate that as saying it's an agent with the "maximum expected utility over all time"
        - So, our utility function CAN'T just be the reward for getting into a state, because then it's not considering time; it needs to somehow be the current reward PLUS the future rewards
            - Now, to know the probabilities though, we need a policy, and to know the policies we need a probability, sooooo...yeah, this'll be one of those iterative algorithms that starts with random numbers and refines itself
    - Now, the most obvious way of doing this would be this:

            U(s0, ..., sn) = R(S0) + ... + R(Sn)

        - This requires us to have ALL of the future states infinitely far into the future, though!
    - To deal with this, there are 2 main approaches:
        1) ADDITIVE UTILITY means that we set a time limit and only look so many X steps into the future; given that assumption, this current formulation of utility is actually fine!
            - This is the original way
            - There's a BIG problem with this, though: our policy will be different at different time steps! If there's only 2 timesteps remaining and we can't get to the goal, then our priorities change to be more conservative; on the other hand, if it can reach the goal, then it'll take more risks as the timer gets to the end
                - If you think back to Bayes' Nets, this basically violates the principle of Stationary distributions; we have to calculate the policies separately for every possible step!
            - Setting the time limit to infinity solves this problem, but it becomes a problem to calculate these utilities; moreover, all of the sequences will then be unbounded and go to infinity/negative infinity!
        2) DISCOUNTED utility means that we deal with this infinite time limit problem by multiplying the future reward by some ratio <1 "gamma"; similar to Taylor Series or something like that, this means that the utility converges to some value

                U(s0, ..., sn) = R(s0) + g*R(s1) + g^2*R(s2) + ... + g^n*R(n)

            - This means our agent will care less about rewards that are far into the future unless it's worth ALOT - great! It naturally balances the value of the rewards w/ the effort required to get the rewards
                - If we choose gamma = 1, of course, we'd end up with Addititive utility again 
                - If we choose gamma = 0, it becomes a single step agent again
                    - "So, gamma is a convenient hyperparameter for changing our agent's behavior"
            - "Keep in mind this is our agent THINKING into the future; it's how much we THINK the path will be worth, not how much it actually will be"
        - Okay, this "discounted utility" has several very nice properties:
            - The utility distribution is stationary (i.e. our policy doesn't change between steps)
            - Utility is finite
            - Penalty for time is built in via gamma, so we don't HAVE to assign arbitrary time penalties (like for GridWorld, we could just assign probabilities for the goal and pit, and leave the rest at 0)
                - Gamma is often conveniently given by the problem (e.g. interest rates for investments, decay rates for radioactive molecules, etc.)

- One thing to consider as we get into Reinforcement Learning is the "Credit Assignment Problem:" if we run for 20 timesteps and don't get any reward, and then SUDDENLY get a reward, WHY did we get that reward? 
    - If we're playing chess and only get a reward for checkmating someone, we might think the clever thing was moving our pawn, when the ACTUALLY clever thing we did that won us the game was ten moves ago
        - What if we did dumb things in between that could've been improved, and just went right at the very end? How do we know what action we should prioritize?
    - Very often, we have delayed rewards that only resulted from some previous actions
- Figuring out what's actually important for us to do, then, is a difficult problem - one we'll talk more about next week.