//****************************************************************************// //************* Parallel Matrix Operations - March 5th, 2020 ****************// //**************************************************************************// - Announcements for today: - HW2 supplement (what was the point?) - Quick review of efficient vs fast algorithms - Matrix operations in parallel - "So, there was 1 big reason why I chose the problems I did for supplement 2: I think a BIG source of confusion when trying to parallelize an algorithm is that you should aim to use these pre-made parallelized operations, like reduce or span or prefix sum, before you try to reinvent the wheel" - For question 2, this means finding an efficient representation of a function with a finite domain (which we can do with an O(k) length lookup table, like an array), and we can then look up a given answer in O(k) time - Therefore, we could do a sequence of "n" compositions in O(k * log(n)) time! - Then, for question 3, we replace this not with a linear function (which we CAN parallelize efficiently, since ), but with the polynomial function "a*x^2 + b" - For "n" function compositions, this'll have a degree of 2^n and 2^(n-1)+1 coefficients (slightly less, but still on the order of 2^n) - Even if we have 2^n processors with all our polynomial coefficients pre-computed, we would still take at LEAST O(log(2^n)) = O(n) operations just to add these numbers together - So, in this case, function composition doesn't parallelize because squaring makes the complexity grow as the number of operations grows - "In tests and homeworks, I'm not asking you to give me compiling C code - but you should give me enough detail to understand what you're doing and conceivably write pseudocode for it, and figure out its runtime" - "There's a very deep theorem about group theory that means a lot of our operations in the class end up going back to linear operations like linear algebra, but we won't go into the details of why" -------------------------------------------------------------------------------- - Alright; we've looked at some embedding stuff, and now we want to apply those techniques to linear algebra (which is SUPER useful in the real world) - Before we do that, though, we have to ask ourselves again: what's the difference between optimal efficiency and optimal speedup? - So, a higher-speedup algorithm will require more max processors than a high-efficiency one, right? - Remember, broadcast/reduce/allreduce/scan (i.e. prefix-sums) are operations where the message size doesn't grow, and are therefore slightly faster than gather/scatter/allgather - Now, matrix algorithms in general have a regular structure with a natural partition for our computation, with 1D and 2D block, cyclic, or block-cyclic decompositions being most common (although certainly not the only ones, as sometimes you can more efficiently avoid downtime by using different decompositions) - Typically, we can split the computation into input, intermediate, and output stages (I think?) - Now, a matrix-vector multiplication has a serial complexity of O(n^2) - A 1D decomposition looks like giving each row to a different processor; a 2D composition looks like giving a "square" of the matrix to each processor - In this case, the 1D case will always have a multiplication of at most size "n", but the 2D blocks won't be longer than "n / sqrt(p)", making it more efficient in this case! - Let's say each processor gets 1 row of the matrix and 1 element of the vector in a 1D partition; therefore, we can do the following algorithm: - Use all-to-all/AllGather to distribute all of "x" to each processor in O(t*log(n) + u*n) - For each processor Pi, compute (in O(n)): y[i] = Sum {j=0, p-1 (A[i,j] * x[j])} - So, to compute a problem of size "n" with this algorithm with "n" processor will be ~O(n), meaning T(n,p) = O(n^2/p) - If we calculate the efficiency, this gets us an optimally efficient algorithm! - In practice, this looks like giving n/p rows and elements of x to each processor, all-to-all broadcast, and then doing n/p local dot products on rows with length "n", matching our theoretical expectation with a runtime T(n,p) = O(n^2/p + un) - Therefore, this'll be cost-optimal so long as p ~= O(n) - That seems well and good - great! But we can only split this up with at most "n" rows; if we want to split these matrices up more, and therefore utilize more than "n" processors, we need to split it into 2D blocks! - Let's assume we have O(n^2) processors, with each owning a single element of the array and (by convention) the last "n" processors each owning a single element of the vector "x"(I think?) - Then, for our algorithm: - Each column in the matrix needs the same input vector, so send vector element for a given row to processors on main diagonal in O(t + u) - Have that processor do a one-to-all broadcast along its column (i.e to "n" processors) in O((t + u)log(n)) - Multiply locally on each processor - Reduce to get the final result from ALL processors in O((t + u)log(n)) + O(log(n)) - So, this gets us an overall computation time of O(log(n)) and a communication time of O(t*log(n) + u*log(n)) - Is this optimally efficient? Nope, since we'd have a scaled runtime of T(n,p) = O(n^2/p * log(n)) against a serial runtime of n^3/p - HOWEVER, Brent Scaling assumes we can't do anything about that log term, when in actuality we can give each processor a n/sqrt(p) - size block, with the vector being distributed in n/sqrt(p) chunks - Therefore, the message size is now n/sqrt(p) for all communications, changing our runtime to now be dominated by $n^2/p$ and giving us: T(n,p) = O(n^2/p + (t + un/sqrt(p)) * log(p)) - ...which in turn gives an efficiency of: n^2/(n^2 + sqrt(p)n*log(p)) - ...which'll give us O(1) efficiency if p = O(n^2/log^2(n)), similar to homework 1 - Continuing to improve the runtime may increase our speedups, but past this will be inefficient - Overall, then, this gives us a computation efficiency of O(n^2/p) and communication of O(t*log(sqrt(p)) + ) - This idea of small dimensions giving us efficient algorithms, but higher dimension ones giving us greater speedups for less efficiency, is going to repeat for these types of problems consistently - So, that's matrix-vector multiplication; what about matrix-matrix multiplication? - For this, we're going to have 2 input matrices and an (initially empty) output matrix "C" - A given row in C needs to use the same row in A to compute its elements, and a given column in B will always be needed for each element of the same column in C - Computing this in a basic loop looks like so: parallel-for i:0 to n-1 parallel-for L:0 to n-1 for k:0 to p-1 (...) - Now, 1D partitioning by rows worked well for matrix-vector multiplication, so let's try it again! - To implement this algorithm, we'll: - (on the slides) - This gives us a communication time of O(p * (t + n^2/(p*u))) = O(tp + n^2/u), and a computation time of O(2n^3/p) - The difficulty here is that n ~= p, which can require a LOT of processors in some cases - Alright; that's 1 algorithm for matrix-matrix multiplication, but we'll go over some more next week, seeing how we can get some interesting lower bounds on memory usage. Read ahead on the slides and stay tuned!