//****************************************************************************// //*********** Optimal 2D Matrices / Caching - March 12th, 2020 **************// //**************************************************************************// - Today's glorious agenda: - Homework 3 (posted by tonight) - due Tuesday after break - Study guide posted tomorrow for Exam 2 - Exam topics: - Bitonic sort - Sample sort - Embeddings - Matrix-matrix multiplications - MPI for above - All of the above will be tested on homework 3 - "I'm planning on one homework problem on each of them" - The exam will happen on THURSDAY when you get back - In-lecture: - Finish matrix-matrix multiplication - A whole 5 people braved the coronavirus to come to lecture...in a way, I'm impressed, but COME ON! - Also, example solutions for programming assignment 2 were posted; one is a TAs solution from a past semester that DOESN'T interleave master/worker processing, and the other is Professor Isaac's solution that does try to use that time - The basic idea behind this is "what if we split the master thread into 2 parts: a 'backend' and a 'frontend'?" The backend generates new partial solutions, and the frontend gives those solutions to the workers when requested - We only have 1 thread, though, so we basically have to implement a poor man's co-routine to handle this on a single thread -------------------------------------------------------------------------------- - "So, I asked people to read ahead on the lower-bounds of communications needed for matrix-matrix multiplication algorithms" - We'll also look at a useful idea we've been dancing around called "iteration space" - ...and then talk about how we can use ideas from parallel algorithms to also improve cache performance for serial algorithms - Let's start by talking about iteration spaces - what are they? - So, for matrix multiplication, we're trying to get the result C = A x B - Where, for individual elements/submatrices: C(i,j) = sum from k=0...n { A(ik) x B(kj) } - So, a naive serial algorithm for matrix multiplication looks like this: for i: for j: for k: c[i,j] += a[i,k] * b[k,j] - We can represent this space as a 3D cube, with the axes i,j, and k |\---------\ | \ C \ | \ \ j B k--------- \ | | \ | A | \| | ---------i - On "A", we're iterating over the i/k plane, on "B" we're iterating over the j/k plane, and over "C" we're iterating through the i/j plane - Inside the cube, a random integer coordinate "xyz" corresponds to some product, a multiplication operation we have to do - So, what we've been trying to with our parallel decompositions is to take some portion of these operations and divvy them up among our separate processors, giving each processor a cube of operations or plane of operations or whatnot from this space to eventually use them to compute "C" - So, if we take a slice from "A" and "B", and project those slices into the cube, their intersection will form a smaller cube/rectangular prism of operations that're needed, and projecting THAT onto C will give the resulting final values those calculations give us - So, assuming we've got those slices, we want to figure out how many operations are needed, and vice-versa: if we have some volume of computations, what inputs will be need from A and B, and what output will we get? - So, right now, the amount of memory moved by Cannon's algorithm (the best matrix multiplication algorithm we've seen so far) is "n^2/p" - ..because to compute an "n*sqrt(p) x n*sqrt(p)" submatrix in C, we need "n" rows, each of size "n/sqrt(p)" size from the input matrices (double-check this?) - But is this the BEST we can do? Can we do any better? How many multiplications do we NEED? - Well, think of iterations spaces! We have some subset inputs from A "sA" and some subset of input operations "sB" from B, to get some subset of the resulting matrix C "sC" - So, in intersection space, the minimum number of computations we need is the intersection of all of these splotches projected inside the cube - What's the minimum size this could be? The smallest each input could be is a square "n*sqrt(p) x n*sqrt(p)" on each face, so the smallest intersection we could possibly get is a cube with sides of length "" (???) - So, the number of multiplies has to be: <= sqrt(|sA| * |sB| * |sC|) - Let's say we split our algorithm into phases, each of which can communicate at most "M" pieces of memory to other processes per phase; then, by the above inequality, each of the subsets has to be <= |2M| - That means that, per phase, there can't be any more than: 2*sqrt(2) * M^(3/2) multiplications - Now, if each round does "W" computations, then the number of rounds "L" must act as follows: L >= W/(2*sqrt(2)*M^(3/2)) - Therefore, L*M >= w/(2*sqrt(2)*sqrt(M)) - M - Thus, if there are "p" processors, at least one of them HAS to perform "m*k*n/p" multiplications, so the entire number of words moved is: m*k*n / (2sqrt(2)*p*sqrt(M)) - M - Now, if we're multiplying two NxN matrices, that mean we have a minimum memory requirement of: >= n^3 / (2*sqrt(2)*p*sqrt(M)) - M => sqrt(3)*n/sqrt(p) ~= sqrt(M) => 1/sqrt(M) ~= sqrt(3)*sqrt(p)/n - So, for Cannon's algorithm, the amount of memory we're trying to send per round is the whole sub-matrix "n^2/p" (I think??), meaning that YES, Cannon's algorithm is optimal! - ...not sure I see it? - So, Cannon's algorithm is optimal in bandwidth, but what about latency? - Assuming we only have enough memory to store 1 matrix on each processor, Cannon's is optimal; but if we have more memory, we can use 3D - (confused by slides???) - "So, that's the last algorithm we'll be looking at in the linear algebra part of the class - well, the dense linear algebra part at least" - That brings us to the last bullet point of the day: how can these parallel algorithm ideas help us write serial algorithms? - In the most basic model where we just have memory and a CPU then, by Brent's lemma, it doesn't help us much: serial algorithms will beat parallel ones - HOWEVER, in modern processors, almost all computers have a cache of recently-used values - and while a computer architecture class would go super in-depth about where to put things in the cache and so forth, today we'll just assume we have a pretty smart cache that lets the user control where the data goes AND what data gets evicted, so we just need to worry about what memory we access - In reality, a good default for real processors is to evict the least recently used piece of memory, although there's a BUNCH of different schemes - Let's say we have a cache of size Z, and we write a loop that looks like this: for i-1...n for j=1...n for k=1...n C[i,j] = A[i,k] * b[k,j] - How will the cache behave? Using iteration spaces, we can actually guess: the inner loops are the ones that change the fastest, so we're going upwards and bringing in a strip of A/B each loop, then moving along j to read in new strips of B... - If each strip is "N" elements, then let's say the cache can hold N/2 elements - about half of each strip - So, currently we're having NO cache reuse despite having ~2*n^3 memory movements to the cache for N^2 of these strips over the whole for loop - How can we do better? - Theoretically, we can do the most multiplication with values in the cache when we have 1/3 of the cache devoted to A, B, and C, respectively; in this case, the best case we can do is: sqrt(Z/3)^3 ~= Z^(3/2) - How many times would we have to refill this cache, then? There are n^3 total operations, so if we can do z^3/2 operations per full cache then the total amount of cache refills needed is: n^3 / (Z^(3/2)) ~= (n/sqrt(Z))^3 - Similarly to where we had sqrt(p) in our parallel matrix lower-bound proof (??), we can achieve this optimal cache usage by basically simulating Cannon's algorithm with p = Z^2 - This essentially does a cube of calculations, then jumps to the next cube; the order doesn't matter too much, but the switch to cubes is HUGE! - If we do this, and use the same set of memory for those Z^(3/2) operations, then we've changed the memory usage to: 2*n^3/sqrt(Z) - So, we've gone from not using the cache at all to getting a speedup proportional to the size of the cache - that's awesome! - Okay, that's time, so I'll post Homework 3 and the study guide for your perusing pleasure later - have a nice break!