//****************************************************************************//
//************* Markov Decision Problems - October 22nd, 2019 ***************//
//**************************************************************************//

- On stage, what do we see? Not Professor Byrd, not he
    - If I write these notes in rhyme, would you prefer to go blind?

- "Okay, let's go!"
    - Dr. Brian Hrolenok is here to give us the lay down on MDPs for machine learning
--------------------------------------------------------------------------------

- Okay, we're going to be learning about Markov Decision processes today - but let's start off with a bit of background (this is VERY similar to stuff from Intro. to AI, so apologies for the repeat)
    - REINFORCEMENT LEARNING is the 3rd big type of machine learning, along with supervised and unsupervised learning, and it's WEIRD compared to them
        - Reinforcement learning is when we have a sequence of decisions we need to make to get some result, and we want to find the "best" sequence for our needs
        - The natural way to think of this is when we have some agent working in an environment, like a robot doing...um...stuff
            - As an example, let's say we're a Roomba in 2 cell world, where the whole world is (you guessed it) 2 cells!
                - The Roomba can move left, right, or clean the cell it's in
                    - However, it only has a 90% chance of moving in the direction it wants to go!
                - Why is this different from planning? Because the environment is NOT deterministic - stuff can change dynamically, so we can't just make a master plan in advance and follow it
    - To do reinforcement learning, our agent is getting 2 pieces of input from the environment: the STATE the environment is in after the agent does something, and a REWARD it gets for doing that action and getting the resulting state
        - We want, of course, to maximize the rewards we're getting for all the actions that we take!
        - For our Roomba, there's a chance it'll move in the wrong direction; let's formalize this into a TRANSITION MODEL
            - To do this, we'll say there's a 90% chance of moving correctly, a 5% chance of moving left instead, and 5% of moving to the right instead
            - We'll define these probabilities in a transition function, like this:

                    T(s, a, s') = p(s'|s, a)

                - ...where "s" is the current state, "a" is the action we tried to take, and "s'" is the resulting state we're interested in ending up in
            - So, if our Roomba is in the left cell and tries to move right, then there's the following probabilities for the left/right state:

                    T(Left, MOVE-RIGHT, Right) = 0.9
                    T(Left, MOVE-RIGHT, Left)  = 0.1

                - If our Roomba INSTEAD tries to move up, into a wall, then there's:

                    T(Left, MOVE-UP, Left)  = 0.95
                    T(Left, MOVE-UP, Right) = 0.05

                - ...since there's a 90% chance of moving straight and hitting a wall, a 5% chance of moving left and also hitting a wall (staying in same cell), and a 5% chance of moving right
            - If we define these probabilities for ALL possible combinations of input states, actions, and output states, then we've got a complete transition model!
    - Now, if we plan a sequence of optimal actions ahead of time and rigidly follow that, there's a good chance that we'll make a mistake and move somewhere wrong at some point, and then executing the rest of our plan doesn't make any sense!
        - So, instead of a PLAN (a pre-set sequence of actions), we'll try using a POLICY (a mapping of states we could be in to actions)
            - e.g. "When in state (Left, not_dusty), move right to check for dirt"
        - How do we come up with this policy? Well, it'll depend on the "reward function" we use
            - A REWARD FUNCTION is a function that gives each possible state a real-numbered "score" for how good it is
                - Sometimes, it might also make sense to give a reward for taking an action in a given state (i.e. R(s,a) instead of R(s)); the math works out to be the exact same (and you can do the same thing by giving rewards for a resulting state), but some people have a preference
            - Let's say we have a map we want our Roomba to move through, and we want to give it a +1 reward for reaching its charger and a -1 reward for falling down the stairs
                - By default, an interesting thing happens: the Roomba doesn't have any encouragement to reach the charger quickly, so it'll try moving TOWARDS A WALL when it's next to the stairs, since that has a 0% chance of falling down the stairs and still a 5% chance of moving towards the charger
                - If we change all the non-goal/death states to be -0.1, though, now the Roomba is being told it's BAD to stay still, so it'll start taking risks and trying to move straight past the stairs!
        - Once we've got this reward function, how do we compute the best policy? I'm glad you asked!
            - At a high level, we want to use the probabilities we have in our transition model AND the reward values to figure out the "utility" of each state; we'll then pick the policy that gets us to the state with the highest utilities!
                - The UTILITY is just the expected long-term reward of being in a state, with future states discounted for being far in the future
                    - How can we define this mathematically for each state?
                        - We could just try adding up the rewards of all the future states we can reach in "T" steps, but how large should T be?
                            - We could make this infinitely far in the future, but that's unbounded! EVERY policy that eventually gets us to the goal will have an infinite score, so we won't have a way of choosing the best one!
                        - To fix the infinite problem, we could multiply each reward by a discount factor "y^t", where 0 < y < 1
                            - This means that states far in the future won't count for as much, so faster plans are now scored better!
                        - HOWEVER, right now the state we can end up in is a random variable; we don't know which state we'll be in after "T" steps!
                            - To fix THIS, we'll use the EXPECTED VALUE of this previous equation!
                    - So, for a given state "s", the utility of the best policy will be:

                            U_bestpol(s) = Expectation( sum{y^t * R(s_t), t to inf} )

                        - This definition of utility actually more-or-less captures how most people think of rewards: bigger rewards are better, and rewards now are better than rewards later

- So, that's rthe utility we're using to decide this "decision problem" we're tackling - but why do we call them MARKOV decision problems?
    - To calculate utility in this way, there's actually one more assumption we're making: that the chance of us ending up in the next state is ONLY dependent on the current state, and not the past history of how we got there
        - This is called the MARKOV assumption, and it greatly simplifies our calculations
            - ...although it is NOT always true; there are problems where our history of actions is VERY important
                - Stocks are actually a HUGE example of this; if you signed a contract 2 years ago with Mr. McMoneyFace, and that contract is about to expire, that's important!
                - However, this can still give us "good enough" performance
    - This in hand, we can define the optimal policy like so:

            Policy(s) = max_scoring_action( sum{p(s'|s,a) * U_bestpol(s'), future_states})

        - This definition seems kinda circular (why?)
    - How do we actually calculate utility, then? We can do this:

            U_bestpol(s) = R(s) + y*max_action(sum{p(s'|s,a) * U_bestpol(s'), future_states})

        - This *looks* circular, but it's actually just a recursive definition for each state!
            - To get the utility, we take the reward of being in that state and then add the utilities of all the possible results from taking the best action, 
        - This is called the "Bellman Update" equation, and it's important for many reinforcement learning operations

- Where did this equation come from, and what can we do with it? That's something Professor Byrd'll cover when he gets back! Bye!