Jake's CS Notes - Intro. to AI

//****************************************************************************//
//************** Introduction to Bayes' Nets- June 7th, 2018 ****************//
//**************************************************************************//

- 2:36: No instructor has entered...has the Byrd migrated for the day?
    - (...the awfulness of this joke has killed 11 neurons and counting; so long as it stands undeleted, it must be considered at large and dangerous to all sentient creatures)
- 2:39: Still not WAIT, NEVERMIND, THE BYRD HAS ARRIVED!
- Hurriedly scratched on the whiteboard, we find:
    - "BAYES' NETS
    - DYNAMIC BAYES' NETS"
- "Apologies for being late, folks; sometimes life has a mind of its own"

- A few quick announcements, and then we'll get right to Bayes' nets - "it's like PacMan but sexier":
    - For the first homework, there's a mean grade of 17.5, median of 20; 9% had a grade greater than 20, 41% had 20, 26% got 18-19, and 24% got less than that
    - After all those announcements, 1 person STILL used mazeDistance as their heuristic, so that person got some points off
    - Several people did get detected for plagiarism and scored a 0 - "I'm not kidding when I say I'm a machine learning nerd, and I keep a repository of everything I've been able to find on GitHub"
- This new homework for Bayes' Nets has a LOT of extra credit - you can get 150% on this assignment, so take advantage of that!
    - "I may slightly delay this homework since it looks like we won't wrap up Bayes' Nets until next week - in the meantime, PLEASE get started, since this is still a hard problem"
- Lastly, I posted some notes from last lecture on Piazza, since I realized I went a little fast towards the end there
---------------------------------------------

- So, last time we computed a bunch of conditional probabilities for the Wumpus problem
- Today, we'll use Bayes' Nets, which are basically a way of visually computing conditional probabilities

- So, how do they work and why should we care?
    - With the FJD, we can compute anything we want - but the PROBLEM with it is that it grows exponentially as we add more variables. After anything more than ~30 variables or so, it becomes unrealistic to compute in real time
        - It also has a practical issue of not working well with sampling data from the real world - we need the EXACT combination of variables to appear to get data, which might take awhile to get data for multiple rare combinations at the same time
    - With the Wumpus problem, we were trying to identify Independent and Conditionally Independent variables to simplify our calculations; done properly, this can reduce the effect on runtime to LINEAR growth as we add variables, which is much, much better!

- So, remember Bayes' Rule? P(A|B) = P(B|A)*P(A) / P(B)
    - Usually, we pull out P(B) and just include it in alpha, since it works out to a constant
    - There's also the Product/Chain Rule: P(A,B) = P(A|B) * P(B)
- What we WANT to do is figure out "given all this evidence, what's the most likely cause?"
    - In order to do this, we'll basically be repetitively applying the Chain Rule (e.g. "P(B|C,A)*P(C|A)*P(A)")

- Believe it or not, this leads to the GENERALIZED BAYES MODEL
    - Essentially, it says this:

            P(Cause|effect1,...,effectn) = P(Cause) * P(eff1|Cause) * ... * P(effn|Cause)

        - In other words, "What's the probability of some cause given all the effects we've seen so far? It's the unconditional probability of the cause times the probability of all the effects"
    - In order to DO this though, we've made an implicit assumption: that ALL of these effects are independent from one another!
        - As you might imagine, we can NOT always assume this; two breezes we detected in different squares might be caused by the same pit, and thus related!
        - "Because of that, this is sometimes also called the 'Naive Bayes Model'"
    - So, this model is going to give us inaccurate results...but is it still useful? In a word: YES!
        - At the cost of our answers being a bit off, we get our computations down to linear growth, and many times these probabilities will still be "good enough" for us to act correctly more often than not
        - So, even though this model is mathematically wrong, it's computationally practical to just assume all our variables are independent anyway
    - By making this assumption, we've gone from a O(2^n) growth to O(n) - that's HUGE!

- So, making this "naive" assumption, we can now squeeze some use out of it with a Bayes' Net
    - What's that? A BAYES' NET is a "directed, acyclic graph (DAG) representing random variables and their probabilities"
        - Remember, directed means our connections between nodes are just 1-way, and acyclic means that once we go down a path, there's no way to get back to a previous point
        - "How is an acyclic graph different from a tree? In a tree, our leaf nodes CANNOT be connected to one another - the "branches" of the tree can't connect, so it's a stricter requirement. In a DAG, we can have one node point to another as long as it doesn't form a cycle"
    - The "nodes" of our Bayes' Net are the random variables, and the edges are the conditional probabilities, with the edges pointing FROM the causes TO the "results"
        - Once again, the edges are PROBABILITIES, not absolute "true/false relationships;" if we had an "age" node with an edge pointing to "dead battery", not all old batteries are dead - but it increases the likelihood
        - Think of our Cavity problem - The cavity is what causes the toothache/catching, so we can have the Cavity be the "parent" node pointing towards the toothache/catch

                Toothache <----- Cavity -----> Catch 

        - Since there's no directed edge pointing towards the cavity, we can see that Cavities cause Toothaches, but not the other way around!
            - Keep in mind that they can STILL affect each other's probabilities - if I have a toothache, I'm more likely to have a cavity - but the toothache does not CAUSE the cavity; the cavity DOES cause the toothache
        - We can ALSO see that, given the value of the cavity, Toothache/Catch are conditionally independent - if we didn't know about cavity, it would SEEM they were related, but since they're both children of cavity and don't have a direct link, they're conditionally independent!
    - In a Bayes' Net, independent variables will be nodes that aren't connected to other nodes
    - A classic example of Bayes' Net is a map representing "What's wrong with my car?", and it looks like a massive flowchart for figuring that out - but it's a Bayes' Net!
- Let's say we have a graph like for the cavity example, but for cancer:

        Test 1 <----- Cancer -----> Test 2

    - In this case, if I know I have cancer, then it would affect the results of Test 1 and Test 2
    - If I DON'T know I have cancer, then T1/T2 are no longer independent; they affect one another (once again, conditional independence at work)
        - A CAUSE is something that we can't observe directly, while effects are things we CAN know for sure
            - We can have multi-level Bayes' Nets, where there can be an overarching cause that we try to infer from several smaller, uncertain causes
        - Again to be clear, because of conditional independence here, we know P(T1|Cancer) is independent of P(T2|Cancer) - even though we wouldn't know P(T1) is simply independent from P(T2)

- Okay, so that's the abstract idea of a Bayes' Net - let's try slapping some numbers on the problem
    - Let's say we have a Bayes' Net for this scenario:
        - At any given time, we could have a burglar in our house
        - At any given time, we could also have a thunderstorm
            - BOTH of these things can set off our alarm, but aren't guaranteed to set it off
        - We've asked both our neighbors, John and Mary, to call us if they see a burglar - but sometimes they give us false alarms
            - For the sake of the problem, let's say John is getting a bit older nowadays, so he's gotten nosy and gives us a lot of false positives - on the other hand, he also almost always calls when the alarm really does go off
            - Mary, on the other hand, likes to play loud music, so she sometimes misses the actual alarm - but when she does call, it's usually right

            Burglars            Thunderstorm
                |                   |
                |                   |
                ----> Alarm <--------
                        |
                    --------
                    |       |
                    \/      \/
                   John    Mary

    - We can apply the Bayes' Model to this!
        - P(x1,...,xn) = P(x1|ParentNodes(x1)) * ... * P(xn|ParentNodes(xn))
    - So, numerically, let's get some probabilities for this:
        - P(Burglar) = 0.001
        - P(Thunder) = 0.002
        - Now, for the alarm, it has 2 parents, so we have to consider the probability for if there's a burglar, a thunderstorm, or neither or both - so we need a CONDITIONAL PROBABILITY TABLE (CPT):

                Burglar | Thunder   | P(alarm|Burglar, Thunder)
                -----------------------------------------------
                True        True        0.95
                True        False       0.94
                False       True        0.29
                False       False       0.001
            
            - Now, since this is a boolean result (the alarm is either on or off), we don't have to make a separate column for "alarm not going off," since it'd just sum to 1 - if the variable has 3 or more possibilities, though (e.g. roll of a dice, channel of a TV, etc.), then we'd need a column for each possible outcome of the variable

        - John has just 1 direct parent (the alarm), so the only thing we have to consider for his probabilities is if the alarm went off or not
                - P(calls|alarm) =    0.9
                - P(calls|no alarm) = 0.05
        - For Mary, it's the same things - just with the numbers shifted beacuse of her loud music
                - P(calls|alarm) =    0.7
                - P(calls|no alarm) = 0.01
    - Now, notice here that we only have 10 probabilities to worry about - but if we'd created a full FJD for John, Mary, Alarm, Burglars, and Thunderstorm (5 boolean variables), we would've had to worry about 32! So, we now need to collect less data to start getting useful information - great!
    - "How do these probabilities map to the edges?" For each of the edges, the WHOLE table maps to the edge - so instead of, say, having 4 edges for each combination of burglar/thunder, we just have the whole conditional probability "P(alarm|Burglar, Thunder)"
- ...now, let's use this situation to answer questions:

    1) What's the probability my alarm goes off when there's no burglar or thunderstorm, and M/J both call me?
        - Formally, we can express this as P(alarm, !burglar, !thunder, mCall, jCall)
            - Using our Bayes' Net, let's re-express this
            - First, let's use independence/conditional independence, expressed by our Bayes' Net graph, to break this apart:

                P(a|!burg,!th) * P(!burg) * P(!thunder) * P(m|a) * P(j|a)

            - We already have all of these numbers! Let's just plug in:

                0.001 * (0.999) * (0.998) * (0.7) * (0.9) = 0.00063

            - In other words, there's a 0.063% chance of this happening

    2) What's the probability that there's a burglary when Mary/John both call me?
        - This is harder than the last question because some of our variables are unknown - we don't have values for them! We don't know if the alarm went off or if there's a thunderstorm
            - "It's harder for us, but the computer doesn't care if it's adding 16 numbers together - it's still peanuts computationally"
        - Formally, the question is finding this:

            P(burg|mCall,jCall)

        - Let's rewrite this using the product rule:

            = P(burg, mCall, jCall) / P(mCall, jCall) = a * P(burg,mCall,jCall)

        - So, let's now "sum out"/"marginalize" out thunderstorm and alarm by summing over all their possibiilities:

            = a * {{ P(burg, jCall, mCall, alarm, thunder), alarm}, thunder}
        
        - The chance of some of these things happening is independent/dependent of the value of alarm/thunderstorm (per our parent-child relationships in the Bayes' Net), so let's break it up by that:

            = a * {{ P(thunder) * P(burg) * P(alarm|burg,thunder) * P(mCall|alarm) * P(jCall|alarm), alarm}, thunder}

        - Then, we just sum this up for the possible values of alarm (on/off) and thunderstorm (true/false), respectively, and plug in the appropriate values (all of which we have for all combinations of thunderstorm/alarm)
            - So, while this is ANNOYING because we have to sum up a lot of values, it's not complicated once you get to the formula - it's just plugging in numbers
            - So, adding all this up, we get 0.00059
                - This is NOT your probability - we have to normalize it!
                - To do that, we have to also calculate the value of NOT P(burg, mCall, jCall) - which, REMEMBER, isn't just subtracting from 1
            - The final answer will work out to ~28% if both Mary and John call us
                - Why is this so low? Because burglaries just aren't that common! Even with the chance of it happening elevated by this extra evidence, there's still a good chance that it's caused by either a thunderstorm or the alarm going off randomly
                
- "One great thing about Bayes Nets: just like with generalized search, it doesn't need to understand the context behind what it's doing - just give it data, and it'll work! With Bayes' Nets, you just put in probabilities and connections, and it'll give you the answer"

- Now, what's missing from these networks so far? They're STATIC - the probabilities never change over time, no new information is gathered, etc.
    - It's entirely possible that our Random Variables might not even exist at the same time - for instance, if you're trying to diagnose diabetes, blood sugar can change because of a number of things: diet, exercise, insulin levels, etc.
        - It's tough to accurately measure blood sugar in the first place, and knowing the blood sugar at any point doesn't tell us much about what's causing that current blood sugar levels
        - If we take readings at regular intervals, though, we might be able to identify a trend: is our blood sugar going up, or down?
    - Another example: if we have a robot, we might have a random variable for our battery's charge, and another variable for velocity, and another for position
        - These all influence each other! Our battery voltage doesn't just change to a random value - it's related to the previous charge reading. We also have "cross relationships" - maybe a higher charge gives us a better velocity, and a higher velocity will (of course) cause our position to change faster
- Dynamic Bayes' Nets will allow us to deal with cause-and-effect relationships over time - and we'll get to those next week