//****************************************************************************//
//*************** Prefix-Sum Algorithm - January 30th, 2020 *****************//
//**************************************************************************//
- "Alright, you're here early (i.e. on time) - so let's reward you by talking about some logistic stuff!"
- To run on the cluster, you should be running it as follows:
qsub -q coc-ice-multi -n -1 nodes=4:core8,walltime=00:30:00 -I
mpirun -np 16 ./prog1.out
- When we start using multiple nodes, we need to specify a "hostfile" that says how much work we want each node to do
- It seems when you run this, it's actually only allocating 4 cores, which means there are more threads then there are processors - and that can lead to HUGE performance overheads, especially with communication protocols
- "Logically, you think this'd give us 4*8 = 32 cores, but it appears that's not what PACE is doing"
- However, because PACE is a shared cluster among multiple classes, asking for more nodes than are currently available may mean it gives you just the nodes that're currently available
- To fix this, we can instead ask for a single node with 16 processors on it:
qsub -q coc-ice-multi -n -1 nodes=1:ppn=16,walltime=00:30:00 -I
- "Going forward, I'd use this syntax; the number of nodes and the number of processors per nodes"
- To actually send files from your computer to the cluster, you can use scp before you "ssh" into the cluster
- Also, note that we're generating distances from (0, r^2) and then taking the square root of them (e.g. x = sqrt(dist)*cos(th)) for statistical reasons; not taking the square root would change the distribution (don't they cancel out?)
--------------------------------------------------------------------------------
- Alright, let's get back to prefix sum!
- As we started learning last time, this is where we're trying to compute all the sums from "x_0 ... x_i" in "x_0 ... x_n"
- MPI_Scan can do this in parallel for us, and not just for sums, but for ANY associative operation
- If can fit things into this prefix sum algorithm, we can save ourselves a lot of code and communication over the network - it might look specific and technical, but it's actually a VERY general technique
- A lot of iterative operations can be expressed in terms of associative things, which "scan" can then compute for us
- Last time, we talked briefly about HOW we can do this in parallel
- Basically, we take the sum of all the things we've seen so far, and pass it forward/backward in a hypercubic permutation; if we receive a number from a lower rank, we add it to our own value, but if from a higher rank then we add it to our "separate" sum to be passed on later
- We can also implement this using shift communicators, which may be more intuitive but is slightly slower and can be awkward on some networks
- Where would we use this? Let's look at some examples!
- Let's say we're doing "polynomial evaluation," where we have the function:
P(x) = a_0 + a_1*x + a_2*x^2 + (...)
- Let's assume the coefficients "a_0 ... a_n" are distributed in even-sized blocks over our "p" processors
- It seems like each processor would have to compute "x, x^2, etc." by itself, which can get REALLY big for large values of x^n
- INSTEAD, we'll compute these powers by having an array where "1" is the first element and "x_0" are in the rest of the elements, with "multiplication" as our operation
- We can then run MPI_Scan on this array, giving all the needed powers of "x" to each processor
- Then, we can compute the sub-sums for each processor (e.g. "a_0 + a_1*x + ... a_(n/p)*x^(n/p)")
- Finally, we can add all these partial sums together with reduction
- Computation-wise, this means T=O(n/p * log(p)), with the communication time being O((t+u) * log(p))
- Efficiency wise, this means
- Let's look at an example that EVERYONE'S seen before: the Fibonacci sequence!
- So, f_0 = f_1 = 1, S_i = S_i-1 + x_i
- Or, in linear algebra land, our update is:
[f_i f_i-1] = [f_i-1 f_i-2] * [1 1]
[1 0]
- We can turn this into a prefix-sum problem by noting that we have some overlapping series - each new element is dependant on the two that came before it!
- If we look at the linear algebra version, we're accumulating these small fixed-size matrices each time we multiply - so we can do prefix-sum on 2x2 Matrix multiplication!
S_i = S_1 x M^(i-1)
- This is NOT something that MPI has built-in, but luckily, we can extend it with our own function to give each processor M^2, M^3, etc.
- We can generalize this to 3x3 matrices for 3-term recurrences, 4x4 for 4-term recurrences, etc. (within reason, since we're still doing matrix multiplications)
- We can ALSO do this with modular arithmetic to get parallel pseudo-random numbers, which is great!
-
- So, all of these examples are naturally "lines of things," where things are dependant one after the other - but we can also use prefix sums to do less obvious stuff!
- For instance, TREE ACCUMULATION, where we're trying to compute the sums of all the subtrees below a given node (or alternatively, going above ("we need directions from our boss's boss's boss or something"))
- Naively traversing the tree, this'd take O(n) in a serial algorithm
- How can we make this a prefix sum? First, we can list the nodes in order of an EULER TOUR, making this list 2*v - 1 long
- Then, to do upward accumulation, we can split this list into "p" blocks among our processors
- Then, the first time (??)
- To do downward accumulation, we can do something similar:
- Put "x_i" for the first occurrence of a node, "-x_i" for the last occurrence of a node, and 0 otherwise
- (?)
- However, this doesn't include leafs in our sum! So, for leaf nodes, we'll put "0"
- "The critical thing to take away here is that we could linearize our tree, taking something that wasn't linear and being able to do linear operations on it"
- Constructing the Euler tour of our tree can be done efficiently in O(n/p), but we'll skip the details for now
- Finally, let's talk about a SUPER important problem in computational biology: DNA SEQUENCE ALIGNMENT!
- Let' say you give me 2 DNA sequences over the alphabet "A,C,G,T", and we want to show how similar they are to one another by trying to ALIGN them
- We'll put the 2 strings together and say that each match is a +1 score, a mismatch is a 0, but introducing a gap to help align them is BAD: that's a -1!
- So, given the sequences ATGACC and AGAATC, the best alignment would be:
ATGA-CC
A-GAATC
- Which gives us a score of "2"
- However, how do we know where to put our gaps? How do we know when adding a gap will give us a higher score?
- We could go with a dynamic programming solution VERY similar to edit distance, where we say we make an "m+1"x"n+1" table and say T[i,j] = best score between "a0...ai" and "b0...bj"
- Then, we can define the following update:
T[i,j] = max(
T[i-1,j] - g, //"g" is the penalty for adding a gap
T[i,j-1] - g,
T[i-1,j-1] + f(ai, bj)
);
- In serial, this algorithm will take O(m*n) time to fill, with entry T[m,n] having the best possible score we could get
- How can we make this a parallel prefix sums problem?
- Essentially, when we're filling in a given row from left-to-right, we already know the above row (and therefore T[i-1, j] and T[i-1,j-1]) - however, we DON'T know T[i, j-1] (i.e. the element to the left of the current one)
(...known, previous row...)
? | ? | ? | CURRENT T[i,j] | ? | ...
- What we CAN do, then, to compute these left elements, is to start at the leftmost element, initialize it to "-i*g" (i.e. assume it's all gaps), and then treat all the elements to the right as a prefix sum operation where we're taking the MAX of the previous calculation (T[i, j-1]) and the max of the known previous row's stuff (MAX(T[i-1, j-1] and T[i-1, j]))!
- So, the ultimate algorithm works out to this: for each row (I believe?)...
- Use right shift to send last element of (i-1)th row on each processor (O(t+u))
- For each COLUMN j, create vector A in O(n/p) using the previous row's information:
A[j] = jg + max{T[i-1, j-1] + f(ai, bj), T[i-1, j] - g}
- ...where f(ai, bj) is the score of the current letter match if we DON'T add a gap
- Compute parallel prefix on A using MAX as an operator; store results in an array x[j] (O(n/p + (t+u) * log p))
- Compute each T[i,j] using in O(n/p) as:
T[i,j] = x[j] - g
- So, how much time will this algorithm take?
- Overall, it'll take:
Computation time = O(mn/p + m log p)
Communication time = O((t+u) * m * log p)
- ...and'll be maximally efficient when p = O(n / (log n))
- That's a LOT better than we had, but can we do even better?
- In short, YES, we can! (again, slides; moving fast at the end there)
- So, in general, finding ways to turn your algorithm into calls to MPI routines that are highly optimized is GREAT, and you should strive to do that
- Alright, on Monday your first programming homework is due, and we'll finish up the MPI stuff then