Jake's CS Notes - Intro. to AI

//****************************************************************************//
//***************** Decision Trees/Review- June 21st, 2018 ******************//
//**************************************************************************//

- Wait, did I forget - is it the first day of summer?
    - *one Google search later* IT IS! Solstice day is upon us, the longest bright!
- *shudder* no agenda...NO AGENDA...THE WHITEBOARD IS BLANK!
    - (this is significantly less important than I think)
- The midterm is on Tuesday - closed book, closed notes as you'd expect, so look over the topic guide, study hard, etc.
-----------------------------------------------------------

- So, we gave an example decision tree last time, and using them is pretty straightforward - we give a bunch of input parameters, follow the tree according to what they are, and whatever leaf we end up at is our answer
    - But how do we actually construct our tree?
        - Well, we could take our first parameter and give it a child for EACH possible option
            - Keep in mind that we can ALWAYS turn a multi-child tree into a binary tree (by splitting each child into a new branching point); this is useful if we only have an algorithm that works on binary trees, or something similar
    - Now, exploring all possibilities for which parameters should go in which order would take too long - so, given a data set that we KNOW the answers for, let's just greedily put the paramters (the "question nodes") in some order
        - For a restaurant example, where we're deciding if we want to actually bother waiting to eat at the restaurant, our first node that we choose could be "food," with 4 restaurant food types as children nodes: "French," "Thai," "Italian," and "Burger"
            - Now, given our dataset, how do we know if this was a good variable to start with?
                - Well, we WANT to have to ask as few questions as possible - in other words, we want to get to an answer as FAST as we can
                - So, let's look at each of the food types and (from the dataset) see how many restaurants in each category we'll eventually wait for or not wait for:

                    French: 1 No, 1 Yes
                    Italian: 1 No, 1 Yes
                    Thai: 2 No, 2 Yes
                    Burger: 2 No, 2 Yes

                - This 50-50 split is actually BAD - we ideally want all the variables to be all "yes" or all "no" - since then, when we reach that food type, we can just assume the answer's a "yes" or a "no"!
            - So, we want to find a split that will maximize the number of variables that are definitely in one category
                - There will probably be cases, though, where no variable has all its children COMPLETELY in just one answer category; to get around this:
                    - If the child is above a certain threshold of "almost all 1 category" (e.g. 5 Nos and 1 Yes), we'll just say that's good enough and consider it a "No" leaf 
                        - "If the leaf isn't unanimous, just take a majority vote"
                    - Otherwise, if it's still pretty evenly split, then we'll just split that node again and ask another question
            - Now, what happens if the tree doesn't have our answer?
                - Well, we can work our way back up the tree until we find a parent that CAN give us a "majority vote" one way or another; if we end up going all the way back to the root without finding a node that can "vote" for us, we'll assume that the root node has a default answer to use in that circumstance (e.g., if we're just not sure if it's worth waiting, don't wait)

- So, those are some good guidelines for creating a shallow/efficient decision tree, but we still need to choose all these things - how do we come up with an algorithm for splitting on the "best node?"
    - Believe it or not, there are many different algorithms for doing this, partially because people disagree about what makes a question node "best"
        - Either way, our GOAL is an algorithm that will get us all leaves - or at least to get as close as possible
    - So, formally, we want to maximize our "information gain" for each level of the tree - we want to minimize our uncertainty for each question we ask

- To help with this, we'll need to bring in a little bit of INFORMATION THEORY (don't worry, you'll get more of this in other classes)
    - Now, let's say that information is predicting the outcome of an experiment
        - So, flipping a 1-sided coin (use your imagination) doesn't give us any information for predicting the next flip - however, if the coin is two-sided, it DOES give us a bit of information for finding the distribution of the coin
    - Let's say we flip a coin 6 times, and get the following outcomes:

            T, H, H, H, T, T

        - By doing this, we've generated 6 "bits" of information
            - As a sidenote, these bits are also the units of "entropy" - which is technically the amount of uncertainty in a problem, but that's outside this class's scope
        - Now, if the coin is fair, we've gained 1 bit of information per flip, since we had no way of knowing what the outcome would be
            - If the coin was COMPLETELY unfair and always gave the same result, we'd have gained 0 bits of information - we already knew what would happen!
            - Anything in-between, though, would give us between 0 and 1 bits of information - we didn't know for sure what the answer was, but we had some guess
        - How do we figure out how much information we gain?
            - Well, if there are "N" possible outcomes v1...vn, and the probability of each is between 1 and N, then we'll say:

                I(P(v1), ..., P(vn)) = {-1 * P(vi) * log2(P(vi)), n}

            - "The base of the log doesn't really matter that much"
            - Let's check to see if this makes intuitive sense for a fair coin:

                I(1/2, 1/2) = -1/2 * log2(1/2) + -1/2 * log2(1/2)
                            = -1/2 * -1 + -1/2 * -1
                            = 1/2 + 1/2
                            = 1

            - Let's see if it works for our 1-sided, completely unfair coin:

                I(1) = -1 log2(1) = -1 * 0 = 0

            - Now, what about a coin that comes up heads 1% of the time, and tails 99% of the time?

                I(1/100, 99/100) = -1/100 * log2(1/100) + -99/100 * log2(99/100)
                                = (...calculator math later...)
                                = 0.08

                - So, because it only gives us 0.08 "bits" of information, we'd have to flip this unfair coin a LOT more times than the fair one to start getting new information - to learn something I didn't know before
    - "Now, information theory isn't exactly barrels of fu...well, it's enlightening and useful, even if you don't appreciate it as you go through it"

- So, if we can figure out for our dataset how much information each answer/question gives us, we can then figure out which variable gives us the most information towards our answer - which is what we want!
    - Sticking with our boolean "yes/no" restaurant example, let's define 2 variables "P" and "N"

            P = # of positive examples ("I will wait here")
            N = # of negative examples ("I won't wait here")

    - So, given that we KNOW P and N for a question, then our probability for a "yes" is P/(P+N), and for a "no" is N/(P+N)
        - Figuring out the general entropy equation for this, then:

            I(P/(P+N), N/(P+N)) = 
                    -P/(P+N) * log2(P/(P+N)) + -N/(P+N) * log2(N/(P+N))

    - So, given this equation, we want to choose the question/variable that will give us the HIGHEST value of I - that will give us the most information towards our answer
        - How do we figure this out? Well, by iterating over each of the variables we could use and checking how much information we'd get
            - Keep in mind that, if we're choosing a variable that isn't the root question (2nd level, 3rd, etc.), the probability would be the probability GIVEN the variables we've already chosen
--------------------------------------------------------------------

- NOW, we'll come back to that next week - but, with the midterm on Tuesday, I'm assuming most of you want to review for that
- ...so, let's review what we've learned here since we wandered in at May
    - These questions are taken from the review midterm stuff I posted on Canvas

    - The first one is a static Bayes' Net question with nodes "B","I","M","G","J"
        - "I swear, this problem was created in the 1990s during Bill Clinton's presidency - any similarities to persons President today is unintended"
        - Now, for question a)
            - The first answer would suggest that B,I, and M are all independent - which isn't true, since B and M both have I as a child, so I depends on B and M
            - The second statement is TRUE - since G is a child of I, and J is a child of G, P(J|G) = P(J|G,I), as J is only directly dependent on G, not on I
            - This is TRUE - P(M|G,B,I) = P(M|G,B,I,J), since M is independent of J

        - For question b)
            - We're just calculating ONE joint probability (in this case, P(b, i, not m, g, j))
                - "These kinds of problems are pretty common, so you should definitely know how to do them"
            - To make this easier, we can split this up using independence/conditional independence:

                = P(b)P(i|b,m)P(not m)P(g|b,i,not m)P(j|g)

            - "We have ALL these probabilities in the JDT , so we just need to plug in"

                = 0.9 * (0.5) * (0.9) * (0.8) * (0.9)
                = 0.2916

        - For question C)
            - Okay, so we have a word problem: "calculate the probability that someone goes to jail GIVEN that they broke the law, face a politically motivated processor, and have been indicted"
                - Simply translated, this is P(j|b,i,m)
            - Now, you could waste a LOT of time on this if you do all the cases - but we can marginalize out G!
                - Therefore, we JUST have to sum over 2 cases: "G" and "not G"

                = a * {P(j,g)}
                = a * (P(j,g) + P(j, not g))
            - So, calculating the first one:
                P(j,g) = P(j|g) * P(g|b,i,m) = 0.9 * 0.9 = 0.81
            - The second one:
                P(j, not g) = P(j| not g) * P(not g|b,i,m) = 0 * 0.1
            - Thus, the total probability is:
                a * (0.81 + 0) = 0.81a
            - We COULD try to calculate the probability of NOT going to jail given this stuff to figure out A, but since we KNOW there are only 2 possibilities, we know that'd be 1 - 0.81 = 0.19 - so, the final answer IS just 0.81 

        - For question d)
            - Quick definition here: a "context-specific independence"

        - For question e)
            - So, if we want to add a new variable "P" representing a presidential pardon, how would the net change? In other words, what would it reasonably affect and why?
                - ...well, it would probably just affect "J" - the probability of going to jail 
            - And what, in turn, might reasonably affect "P"?
                - Well, "I" - being indicted - and "G" - if the person was found guilty - would both DEFINITELY affect P and be parents of it, since you can't pardon someone who wasn't found guilty and was never indicted
                    - The other two arguably could affect it, too, but they aren't required

    - The 2nd question's a Search problem: given a graph A, B, C, D, G, and S
        - "Again, this is a problem where you can waste a LOT of time if you don't know how to properly do things"
        - For a)
            - Breadth-first search would just return "S,G" - you start at S, and S is adjacent to G, so it would just hop straight to G!
            - Remember, BFS doesn't factor in edge costs
        - For b)
            - So, we need to give the visited list and solution for UCS: starting from "S," then:

                Open: (0,S,-)x (1,A,S)x (12,G,S - too high cost, look elsewhere) (2,C,SA)x (3,D,SAC)x (4,B,SAC)x (4,G,SAC)x (6,G,SACD) (7,D,SAB)
                Closed: S,A,C,D,B,G

                Solution: SACG
        - For c)
            - Giving the visited list/solution for DFS, we assume that we add things to the stack in alphabetical order
                - Do we add duplicates to the open list? Either one's fine, but make sure you're consistent

                Solution: SACG
        - For d)
            - Since UCS always returns the optimal path, we ALREADY know that A* with a consistent heuristic will ALSO return the same path as UCS - namely, SACG
        - For e)
            - Considering the heuristics, are they admissible/consistent?
                - For h1, it is NOT admissible, since h1(s) overestimates the distance to the goal for S (we know from UCS that the shortest path from S to the goal is 4, but the heuristic guessed 5)
                    - h1 also CAN'T consistent, since it isn't admissible
                - For h2, it is admissible, but it is NOT consistent since the heurstic estimates the distance between S and A is 2, when it's actually 1
        - "That covers most of what we would cover on searching problems"

    - The 3rd review question is on dynamic Bayes' Nets, given a 3x3 grid 
        - For a)
            - The X's should be connected to each other - they clearly affect one another!
                - We could NOT take an action that would affect both X and Y in 1 step, so X/Y should stay independent
            - Both X and Y should be connected to the sensor readings in the same timestep - as our position affects what our sensor reading is!
            - Connect the action we take to the X/Y for the next step
                - "Notice that our action does NOT directly affect the sensor, but indirectly influences it through our position"
        - For b)
            - The easiest way to start this problem is by drawing our world and thinking where we could start
                - Since N1=1, we start next to 1 wall - so we can't start in the corners or in the middle of the 3x3 grid
                - We then try to move right (a1 = right) and n2=2 - so our move WAS succesful, so we couldn't have moved into a wall
                - ...and since n2 = 2, we must've moved into a corner
            - Therefore, the only 2 positions consistent with this are the top-middle and bottom-middle!
        - For c)
            - So, we'll never got false positives, but we might get false negatives - in other words, the number of walls our sensor gives is either equal to or less than the actual number of walls
                - There's a chance it'll "miss" each wall around it
            - Now, N2 = 1 - there's at LEAST 1 wall next to it when it finishes
                - Therefore, it can be anywhere except the center
            - We got there by taking "action right" - so, we could NOT have been in the left-center at the start, since moving from there would've moved us into the center!
            - We also could NOT have been in the middle, since N1=1, but there atre no walls in the center
            - Therefore, we couldn't have been in EITHER left-center or the center to start!
        - For d)
            - If we keep moving right and down, we can guarantee that we're in the bottom-right corner since we know where the walls are - so, it's FALSE that we can't know for certain where we are! We can coerce the environment by moving ourselves!