//****************************************************************************// //*************** Prefix-Sum Algorithm - January 30th, 2020 *****************// //**************************************************************************// - "Alright, you're here early (i.e. on time) - so let's reward you by talking about some logistic stuff!" - To run on the cluster, you should be running it as follows: qsub -q coc-ice-multi -n -1 nodes=4:core8,walltime=00:30:00 -I mpirun -np 16 ./prog1.out - When we start using multiple nodes, we need to specify a "hostfile" that says how much work we want each node to do - It seems when you run this, it's actually only allocating 4 cores, which means there are more threads then there are processors - and that can lead to HUGE performance overheads, especially with communication protocols - "Logically, you think this'd give us 4*8 = 32 cores, but it appears that's not what PACE is doing" - However, because PACE is a shared cluster among multiple classes, asking for more nodes than are currently available may mean it gives you just the nodes that're currently available - To fix this, we can instead ask for a single node with 16 processors on it: qsub -q coc-ice-multi -n -1 nodes=1:ppn=16,walltime=00:30:00 -I - "Going forward, I'd use this syntax; the number of nodes and the number of processors per nodes" - To actually send files from your computer to the cluster, you can use scp before you "ssh" into the cluster - Also, note that we're generating distances from (0, r^2) and then taking the square root of them (e.g. x = sqrt(dist)*cos(th)) for statistical reasons; not taking the square root would change the distribution (don't they cancel out?) -------------------------------------------------------------------------------- - Alright, let's get back to prefix sum! - As we started learning last time, this is where we're trying to compute all the sums from "x_0 ... x_i" in "x_0 ... x_n" - MPI_Scan can do this in parallel for us, and not just for sums, but for ANY associative operation - If can fit things into this prefix sum algorithm, we can save ourselves a lot of code and communication over the network - it might look specific and technical, but it's actually a VERY general technique - A lot of iterative operations can be expressed in terms of associative things, which "scan" can then compute for us - Last time, we talked briefly about HOW we can do this in parallel - Basically, we take the sum of all the things we've seen so far, and pass it forward/backward in a hypercubic permutation; if we receive a number from a lower rank, we add it to our own value, but if from a higher rank then we add it to our "separate" sum to be passed on later - We can also implement this using shift communicators, which may be more intuitive but is slightly slower and can be awkward on some networks - Where would we use this? Let's look at some examples! - Let's say we're doing "polynomial evaluation," where we have the function: P(x) = a_0 + a_1*x + a_2*x^2 + (...) - Let's assume the coefficients "a_0 ... a_n" are distributed in even-sized blocks over our "p" processors - It seems like each processor would have to compute "x, x^2, etc." by itself, which can get REALLY big for large values of x^n - INSTEAD, we'll compute these powers by having an array where "1" is the first element and "x_0" are in the rest of the elements, with "multiplication" as our operation - We can then run MPI_Scan on this array, giving all the needed powers of "x" to each processor - Then, we can compute the sub-sums for each processor (e.g. "a_0 + a_1*x + ... a_(n/p)*x^(n/p)") - Finally, we can add all these partial sums together with reduction - Computation-wise, this means T=O(n/p * log(p)), with the communication time being O((t+u) * log(p)) - Efficiency wise, this means - Let's look at an example that EVERYONE'S seen before: the Fibonacci sequence! - So, f_0 = f_1 = 1, S_i = S_i-1 + x_i - Or, in linear algebra land, our update is: [f_i f_i-1] = [f_i-1 f_i-2] * [1 1] [1 0] - We can turn this into a prefix-sum problem by noting that we have some overlapping series - each new element is dependant on the two that came before it! - If we look at the linear algebra version, we're accumulating these small fixed-size matrices each time we multiply - so we can do prefix-sum on 2x2 Matrix multiplication! S_i = S_1 x M^(i-1) - This is NOT something that MPI has built-in, but luckily, we can extend it with our own function to give each processor M^2, M^3, etc. - We can generalize this to 3x3 matrices for 3-term recurrences, 4x4 for 4-term recurrences, etc. (within reason, since we're still doing matrix multiplications) - We can ALSO do this with modular arithmetic to get parallel pseudo-random numbers, which is great! - - So, all of these examples are naturally "lines of things," where things are dependant one after the other - but we can also use prefix sums to do less obvious stuff! - For instance, TREE ACCUMULATION, where we're trying to compute the sums of all the subtrees below a given node (or alternatively, going above ("we need directions from our boss's boss's boss or something")) - Naively traversing the tree, this'd take O(n) in a serial algorithm - How can we make this a prefix sum? First, we can list the nodes in order of an EULER TOUR, making this list 2*v - 1 long - Then, to do upward accumulation, we can split this list into "p" blocks among our processors - Then, the first time (??) - To do downward accumulation, we can do something similar: - Put "x_i" for the first occurrence of a node, "-x_i" for the last occurrence of a node, and 0 otherwise - (?) - However, this doesn't include leafs in our sum! So, for leaf nodes, we'll put "0" - "The critical thing to take away here is that we could linearize our tree, taking something that wasn't linear and being able to do linear operations on it" - Constructing the Euler tour of our tree can be done efficiently in O(n/p), but we'll skip the details for now - Finally, let's talk about a SUPER important problem in computational biology: DNA SEQUENCE ALIGNMENT! - Let' say you give me 2 DNA sequences over the alphabet "A,C,G,T", and we want to show how similar they are to one another by trying to ALIGN them - We'll put the 2 strings together and say that each match is a +1 score, a mismatch is a 0, but introducing a gap to help align them is BAD: that's a -1! - So, given the sequences ATGACC and AGAATC, the best alignment would be: ATGA-CC A-GAATC - Which gives us a score of "2" - However, how do we know where to put our gaps? How do we know when adding a gap will give us a higher score? - We could go with a dynamic programming solution VERY similar to edit distance, where we say we make an "m+1"x"n+1" table and say T[i,j] = best score between "a0...ai" and "b0...bj" - Then, we can define the following update: T[i,j] = max( T[i-1,j] - g, //"g" is the penalty for adding a gap T[i,j-1] - g, T[i-1,j-1] + f(ai, bj) ); - In serial, this algorithm will take O(m*n) time to fill, with entry T[m,n] having the best possible score we could get - How can we make this a parallel prefix sums problem? - Essentially, when we're filling in a given row from left-to-right, we already know the above row (and therefore T[i-1, j] and T[i-1,j-1]) - however, we DON'T know T[i, j-1] (i.e. the element to the left of the current one) (...known, previous row...) ? | ? | ? | CURRENT T[i,j] | ? | ... - What we CAN do, then, to compute these left elements, is to start at the leftmost element, initialize it to "-i*g" (i.e. assume it's all gaps), and then treat all the elements to the right as a prefix sum operation where we're taking the MAX of the previous calculation (T[i, j-1]) and the max of the known previous row's stuff (MAX(T[i-1, j-1] and T[i-1, j]))! - So, the ultimate algorithm works out to this: for each row (I believe?)... - Use right shift to send last element of (i-1)th row on each processor (O(t+u)) - For each COLUMN j, create vector A in O(n/p) using the previous row's information: A[j] = jg + max{T[i-1, j-1] + f(ai, bj), T[i-1, j] - g} - ...where f(ai, bj) is the score of the current letter match if we DON'T add a gap - Compute parallel prefix on A using MAX as an operator; store results in an array x[j] (O(n/p + (t+u) * log p)) - Compute each T[i,j] using in O(n/p) as: T[i,j] = x[j] - g - So, how much time will this algorithm take? - Overall, it'll take: Computation time = O(mn/p + m log p) Communication time = O((t+u) * m * log p) - ...and'll be maximally efficient when p = O(n / (log n)) - That's a LOT better than we had, but can we do even better? - In short, YES, we can! (again, slides; moving fast at the end there) - So, in general, finding ways to turn your algorithm into calls to MPI routines that are highly optimized is GREAT, and you should strive to do that - Alright, on Monday your first programming homework is due, and we'll finish up the MPI stuff then