//****************************************************************************//
//******************* Input Analysis - March 14th, 2019 *********************//
//**************************************************************************//

- It is still PI DAY!
    - ...yet I have consumed a conspicuously small amount of pie (and pi; come to think of it, the number of abstract mathematical constants I have consumed is FRIGHTENINGLY infinitesimal D:)
--------------------------------------------------------------------------

- Alright, last time we talked about random number generation, and using that uniform RNG function to model random distributions - neat!
    - But how do we USE these random numbers? Why should we care about them as simulation-makers? That's what we'll be diving into today!
        - In particular, we'll be talking about "input analysis," and what randomness and distributions have to do with it
            - ...if nothing else, you should care because this is a decent chunk of your 2nd project

- Before we get down to this, though, let's talk about some basic assumptions we're making:
    - Most discrete event simulations are STOCHASTIC, involving some element of randomness
        - This randomness is brought in through "stochastic processes," which are a sequence of random variables (commonly ordered by time)
            - For instance, we might have a list of the arrival times of cars entering a gas station, each of which'll be different from run to run
        - A STATIONARY stochastic process is where the distribution of the random variables themselves never changes
            - This is NOT true for many physical systems (for instance, traffic distributions aren't constant - it's going to decrease when it's late at night!)
            - If we're looking at this system for a limited window of time, though, we might still reasonably assume the process is stationary
        - We'll also often assume that the process is INDEPENDENT (or AUTONOMOUS), meaning it doesn't depend on the behavior of the system itself
            - For instance, the arrival time of each car probably doesn't depend on the behavior of all the cars before it, so we can assume the distribution won't behave differently over different runs - but sometimes it does!
                - If you pass a Starbucks with a long line, for example, you might choose not to go there because the line was long!
                - Similarly, if you were modeling a line at a sandwich shop, the employee at the counter might start speeding up if the line gets long, meaning the wait times are now DEPENDENT on the line's length!
                    - Dependent stochastic processes mean that our random variables are now related to each other, and means that our simulations get more complicated
    - We'll often try and deal with HOMOGENOUS stochastic processes, which is when the process is stationary and all the random variables are independent AND "identically distributed" (i.e. all the customers are arriving based on the SAME distribution curve)
        - This might not *always* be realistic, but it is widely used to model the inputs to a system (like the arrival of customers), and

- All of this leads us to the idea of INPUT MODELING: trying to define the processes that govern our system's inputs, and turn those into models
    - There are 3 general approaches to doing this:
        - TRACE-DRIVEN input processes are where we base our distribution directly off of the data we have itself
            - This is commonly used when we don't have a good idea of the underlying model behind our observations, e.g. when testing a new computer architecture
        - Selecting a "STANDARD DISTRIBUTION", and then trying to fit it as well as we can to the data
        - Constructing an EMPIRICAL DISTRIBUTION based off of the data we have
    - Throughout this whole thing, make sure you're aware of what independence assumptions you're making!

- From this, our goal is to create a distribution that "models an autonomous homogenous stochastic process from the data"
    - To do this, we have to ask two big questions:
        - Do we even HAVE a homogenous stochastic process?
            - We'll try to answer this by applying correlation tests for independence, and testing for stationary behavior
        - How do we fit a distribution to this data, anyway?
            - Often, we'll create a histogram of our data, select an existing distribution that seems to fit, and tweak it until it works well
            - We might also create a brand-new empirical distribution, if the canonical stuff isn't working out
    - Also, what if we just don't have enough data to draw firm conclusions? Can we still make good decisions then?

- First up, how do we test our data for independence?
    - Surprisingly, we can get a pretty good start by just visualizing the data and seeing if it "looks" correlated
        - Given a time-ordered sequence of observations (e.g. results of a series of coin flips, hourly temperature readings, customer arrivals at a store, etc.), would the results be correlated with one another?
            - For the coin flips, no! We'd expect each flip to be independent
            - For the weather, YES! If it's warm one hour, it'll probably still be warm an hour later!
        - We'll usually measure the amount of correlation between two different sequences using the COVARIANCE, where a positive covariance means a direct relationship, negative means an inverse relationship ("if X is big, Y is small"), and close to 0 means little to no correlation
            - We'll compute this using the formula:

                    Cov(X, Y) = sum((X-ux) * (Y - uy))

                - Where X/Y are the current values for the random variables, and ux/uy are the mean values for X/Y
        - We can often get a pretty good idea just by plotting (X, Y) values on a scatter plot and seeing if there's a negative slope, positive slope, or no clear pattern
            - To test for correlation WITHIN a single time series, we'll just plot x_i, x_i+1, x_i+2, etc., and see if there's a pattern that emerges 
    - What if we want to check for correlation in a single series not just between successive points, but between more distant points?
        - Well, we could do a scatter plot for "x_i, x_i+k, etc.,"
    - We could also do an AUTOCORRELATION plot, since scatter plots only show correlation between successive data points
        - With autocorrelation plots, though, we instead plot the "autocorrelation" between x_i and x_i+k, calculated like this:

                autocor(k) = sum from 1 to N-k {(x_i - ux)*(x_i+k - ux)} / ((n-k) * var(x))

            - Where ux and var(x) are the SAMPLE mean and variance, respectively
        - An autocorrelation near 0 indicates an independent time series, while values near 1 indicate values totally dependent on the "kth" value away
            - We'll typically plot this for a range of values of k (e.g. k = 1 to k = 30)

- How do we test for stationarity, then? How do we make sure our process really is stationary?
    - Well, if the process is stationary, then that means the mean value of our results shouldn't be changing over time!
        - The arrival times at a "stationary" deli on Thursday shouldn't differ from the arrival times on Monday, for example
    - To measure this, we'll take our data points, somehow group them together into different bins for different time segments, and then compute and plot the average of each bin
        - So, if we had data for our deli over a week, we could make a bin for each day, or we could instead make hourly bins for "9:00am", "10:00am," etc. (grouping together the data from different days in the same bin), depending on what we're asking
            - If the process really is stationary, there should be little change in the mean; otherwise, if there is significant change, it's not stationary!

- Alright, we've done these tests and decided our data is reasonably homogenous - great! But how can we get a distribution we can use for it, then? What's the next step?
    - We could jump straight to making an empirical distribution, but that can have a few drawbacks:
        - There's no way to modify this distribution; whatever we measured, that's what we got!
            - For instance, how do we change the distribution to represent higher levels of traffic when all the data we've got is for 4am on a Sunday?
        - It also means we're limited to the range of values we actually measured - if the maximum flight time in our data set was 6 hours, we're never going to get an accurate time for a Hong Kong to New york flight!
    - A good first approach, then, is to try and find an existing theoretical distribution 
        - This tends to be less noisy than our real-world data, and gives us more options to play with
        - To actually do this, we'll typically do the following:
            - Create a histogram of our actual data
            - Select some candidate distributions
            - Try to tweak the distribution parameters to match our histogram
            - Choose the best-fitting candidate

- Let's walk through the theoretical distribution steps a little more closely:
    - First off, creating the histogram is pretty straightforward for discrete data: we just count how many occurrences we had for each value and plot it
        - What about for continuous cases, though, where all we've got are some ranges of continuous times? Well, we can make them discrete!
            - Similarly to what we did in the stationary tests, we can make "bins" for the different observed values, count the number of values between each bin, and then do what we 
                - If we make our bins too small, we can end up with bumpy-looking curves
                - If they're too large, then we can accidentally gloss over fine details that are important
                    - Some common suggestions are to use sqrt(n) bins, or (2n)^1/3 bins, but these are just rules of thumb
    - Next up, after choosing some candidates, we want to tweak their parameters to match our histogram - but how do we do that?
        - One way is to look at the MAXIMUM LIKELIHOOD ESTIMATORS for the data, which exist for many common distributions and tell us roughly how to go between empirical observations and theoretical parameters
    - After doing that for all our candidates, we've gotta choose the best one - with math!
        - There're various "goodness-of-fit" and "hypothesis" tests to do this, but a particularly famous one is the CHI-SQUARED TEST
            - This is where we assume that the "real" distribution resulting in our observation is the current candidate, and we calculate an ADEQUACY MEASURE:

                    A = sum((O_i - E_i)^2/E_i)

                - Where we're summing up over each of the "k" bins we have, and O_i is the observed frequency for the "ith" bin and E_i is the expected number we'd see for the chosen distribution we're testing out
                    - Basically, we're just summing up the error estimates for our observations
            - The closer the adequacy measure is to 0, the better the distribution is - but how close to 0 does it need to be?
                - As it turns out, if we have "k" bins, then A should converge to the Chi-square distribution with k-1 degrees of freedom
                - Looking at that k-1 distribution, we'll choose a "level-of-significance" alpha that's our accepted probability of rejecting a true hypothesis, and then find the CRITICAL VALUE "chi*" by looking at a table
                    - If A > chi*, then it means that there's a less than alpha chance this distribution generated these results, so we should reject it!
        - Okay - we'll apply this test to all of our candidate distributions, and then choose the best one!
            - One rule of thumb people have suggested, though, is that there should be at LEAST 5 elements in each bin for the Chi-squared test to work
                - What if some of our bins our smaller than that? Then we can just increase our bin size!
            - (example walking through this test for the exponential/gamma distributions on the slides)

- Going back to empirical distributions, let's say we've got all this data but no theoretical distribution seems to be a good fit; what should we do?
    - Well, we'll create a CDF from our data like we described yesterday, then use a random-number generator to pick from this!
        - To make this less "choppy," we'll very often linearly interpolate between our data points so that the distribution itself is continuous, and can output values other than the ones we've seen

- If NO data at all is available, well, life gets harder...but we can still make some educated guesses
    - If we only know the minimum/maximum possible values, we can use the uniform distribution; if we also have some idea of what values are "likely," try using the beta distribution, or something similar!
    - We can also check out related, respected projects to ours and see how they handled it - if we don't have any arrival data on Milwaukee's Best Coffee Shop, maybe that Starbucks' paper is a good-enough starting point

- Alright - in summary, we often assume our simulations are homogenous, but it's a BIG assumption that still needs to be backed up
    - We can try proving this using a variety of statistical tests
    - From there, we can use a variety of procedures to parameterize our inputs into a distribution we can actually use
- Again, input analysis is a BIG research area you can sink decades into if you so wish; we're just covering the fundamentals here

- Okay, and that's all for today! We'll finish talking about project lifecycles next time, and then start trucking on to the wild world of parallel processing - until then, have a good spring break!