//****************************************************************************//
//*************** Prefix-Sum Algorithm - January 30th, 2020 *****************//
//**************************************************************************//

- "Alright, you're here early (i.e. on time) - so let's reward you by talking about some logistic stuff!"
    - To run on the cluster, you should be running it as follows:

            qsub -q coc-ice-multi -n -1 nodes=4:core8,walltime=00:30:00 -I
            mpirun -np 16 ./prog1.out

        - When we start using multiple nodes, we need to specify a "hostfile" that says how much work we want each node to do
            - It seems when you run this, it's actually only allocating 4 cores, which means there are more threads then there are processors - and that can lead to HUGE performance overheads, especially with communication protocols
                - "Logically, you think this'd give us 4*8 = 32 cores, but it appears that's not what PACE is doing"
                - However, because PACE is a shared cluster among multiple classes, asking for more nodes than are currently available may mean it gives you just the nodes that're currently available
        - To fix this, we can instead ask for a single node with 16 processors on it:

                qsub -q coc-ice-multi -n -1 nodes=1:ppn=16,walltime=00:30:00 -I

            - "Going forward, I'd use this syntax; the number of nodes and the number of processors per nodes"
        - To actually send files from your computer to the cluster, you can use scp before you "ssh" into the cluster
        - Also, note that we're generating distances from (0, r^2) and then taking the square root of them (e.g. x = sqrt(dist)*cos(th)) for statistical reasons; not taking the square root would change the distribution (don't they cancel out?)
--------------------------------------------------------------------------------

- Alright, let's get back to prefix sum!
    - As we started learning last time, this is where we're trying to compute all the sums from "x_0 ... x_i" in "x_0 ... x_n"
    - MPI_Scan can do this in parallel for us, and not just for sums, but for ANY associative operation
        - If can fit things into this prefix sum algorithm, we can save ourselves a lot of code and communication over the network - it might look specific and technical, but it's actually a VERY general technique
            - A lot of iterative operations can be expressed in terms of associative things, which "scan" can then compute for us
    - Last time, we talked briefly about HOW we can do this in parallel
        - Basically, we take the sum of all the things we've seen so far, and pass it forward/backward in a hypercubic permutation; if we receive a number from a lower rank, we add it to our own value, but if from a higher rank then we add it to our "separate" sum to be passed on later
        - We can also implement this using shift communicators, which may be more intuitive but is slightly slower and can be awkward on some networks

- Where would we use this? Let's look at some examples!
    - Let's say we're doing "polynomial evaluation," where we have the function:

            P(x) = a_0 + a_1*x + a_2*x^2 + (...)

        - Let's assume the coefficients "a_0 ... a_n" are distributed in even-sized blocks over our "p" processors
            - It seems like each processor would have to compute "x, x^2, etc." by itself, which can get REALLY big for large values of x^n
            - INSTEAD, we'll compute these powers by having an array where "1" is the first element and "x_0" are in the rest of the elements, with "multiplication" as our operation
                - We can then run MPI_Scan on this array, giving all the needed powers of "x" to each processor
                - Then, we can compute the sub-sums for each processor (e.g. "a_0 + a_1*x + ... a_(n/p)*x^(n/p)")
            - Finally, we can add all these partial sums together with reduction
        - Computation-wise, this means T=O(n/p * log(p)), with the communication time being O((t+u) * log(p))
            - Efficiency wise, this means
    - Let's look at an example that EVERYONE'S seen before: the Fibonacci sequence!
        - So, f_0 = f_1 = 1, S_i = S_i-1 + x_i
            - Or, in linear algebra land, our update is:

                    [f_i f_i-1] = [f_i-1 f_i-2] * [1 1]
                                                  [1 0]

            - We can turn this into a prefix-sum problem by noting that we have some overlapping series - each new element is dependant on the two that came before it!
            - If we look at the linear algebra version, we're accumulating these small fixed-size matrices each time we multiply - so we can do prefix-sum on 2x2 Matrix multiplication!

                    S_i = S_1 x M^(i-1)

                - This is NOT something that MPI has built-in, but luckily, we can extend it with our own function to give each processor M^2, M^3, etc.
            - We can generalize this to 3x3 matrices for 3-term recurrences, 4x4 for 4-term recurrences, etc. (within reason, since we're still doing matrix multiplications)
    - We can ALSO do this with modular arithmetic to get parallel pseudo-random numbers, which is great!
        -

- So, all of these examples are naturally "lines of things," where things are dependant one after the other - but we can also use prefix sums to do less obvious stuff!
    - For instance, TREE ACCUMULATION, where we're trying to compute the sums of all the subtrees below a given node (or alternatively, going above ("we need directions from our boss's boss's boss or something"))
        - Naively traversing the tree, this'd take O(n) in a serial algorithm
        - How can we make this a prefix sum? First, we can list the nodes in order of an EULER TOUR, making this list 2*v - 1 long
            - Then, to do upward accumulation, we can split this list into "p" blocks among our processors
            - Then, the first time (??)
        - To do downward accumulation, we can do something similar:
            - Put "x_i" for the first occurrence of a node, "-x_i" for the last occurrence of a node, and 0 otherwise
            - (?)
            - However, this doesn't include leafs in our sum! So, for leaf nodes, we'll put "0"
    - "The critical thing to take away here is that we could linearize our tree, taking something that wasn't linear and being able to do linear operations on it"
        - Constructing the Euler tour of our tree can be done efficiently in O(n/p), but we'll skip the details for now

- Finally, let's talk about a SUPER important problem in computational biology: DNA SEQUENCE ALIGNMENT!
    - Let' say you give me 2 DNA sequences over the alphabet "A,C,G,T", and we want to show how similar they are to one another by trying to ALIGN them
        - We'll put the 2 strings together and say that each match is a +1 score, a mismatch is a 0, but introducing a gap to help align them is BAD: that's a -1!
            - So, given the sequences ATGACC and AGAATC, the best alignment would be:

                    ATGA-CC
                    A-GAATC

                - Which gives us a score of "2"
    - However, how do we know where to put our gaps? How do we know when adding a gap will give us a higher score?
        - We could go with a dynamic programming solution VERY similar to edit distance, where we say we make an "m+1"x"n+1" table and say T[i,j] = best score between "a0...ai" and "b0...bj"
            - Then, we can define the following update:

                    T[i,j] = max(
                        T[i-1,j] - g,     //"g" is the penalty for adding a gap
                        T[i,j-1] - g,
                        T[i-1,j-1] + f(ai, bj)
                    );

            - In serial, this algorithm will take O(m*n) time to fill, with entry T[m,n] having the best possible score we could get
        - How can we make this a parallel prefix sums problem?
            - Essentially, when we're filling in a given row from left-to-right, we already know the above row (and therefore T[i-1, j] and T[i-1,j-1]) - however, we DON'T know T[i, j-1] (i.e. the element to the left of the current one)

                    (...known, previous row...)
                    ? | ? | ? | CURRENT T[i,j] | ? | ...

                - What we CAN do, then, to compute these left elements, is to start at the leftmost element, initialize it to "-i*g" (i.e. assume it's all gaps), and then treat all the elements to the right as a prefix sum operation where we're taking the MAX of the previous calculation (T[i, j-1]) and the max of the known previous row's stuff (MAX(T[i-1, j-1] and T[i-1, j]))!
        - So, the ultimate algorithm works out to this: for each row (I believe?)...
            - Use right shift to send last element of (i-1)th row on each processor (O(t+u))
            - For each COLUMN j, create vector A in O(n/p) using the previous row's information:

                    A[j] = jg + max{T[i-1, j-1] + f(ai, bj), T[i-1, j] - g}

                - ...where f(ai, bj) is the score of the current letter match if we DON'T add a gap
            - Compute parallel prefix on A using MAX as an operator; store results in an array x[j] (O(n/p + (t+u) * log p))
            - Compute each T[i,j] using in O(n/p) as:

                    T[i,j] = x[j] - g

    - So, how much time will this algorithm take?
        - Overall, it'll take:

                Computation time = O(mn/p + m log p)
                Communication time = O((t+u) * m * log p)

            - ...and'll be maximally efficient when p = O(n / (log n))

- That's a LOT better than we had, but can we do even better?
    - In short, YES, we can! (again, slides; moving fast at the end there)

- So, in general, finding ways to turn your algorithm into calls to MPI routines that are highly optimized is GREAT, and you should strive to do that

- Alright, on Monday your first programming homework is due, and we'll finish up the MPI stuff then