//****************************************************************************//
//*********** Optimal 2D Matrices / Caching - March 12th, 2020 **************//
//**************************************************************************//
- Today's glorious agenda:
- Homework 3 (posted by tonight)
- due Tuesday after break
- Study guide posted tomorrow for Exam 2
- Exam topics:
- Bitonic sort
- Sample sort
- Embeddings
- Matrix-matrix multiplications
- MPI for above
- All of the above will be tested on homework 3 - "I'm planning on one homework problem on each of them"
- The exam will happen on THURSDAY when you get back
- In-lecture:
- Finish matrix-matrix multiplication
- A whole 5 people braved the coronavirus to come to lecture...in a way, I'm impressed, but COME ON!
- Also, example solutions for programming assignment 2 were posted; one is a TAs solution from a past semester that DOESN'T interleave master/worker processing, and the other is Professor Isaac's solution that does try to use that time
- The basic idea behind this is "what if we split the master thread into 2 parts: a 'backend' and a 'frontend'?" The backend generates new partial solutions, and the frontend gives those solutions to the workers when requested
- We only have 1 thread, though, so we basically have to implement a poor man's co-routine to handle this on a single thread
--------------------------------------------------------------------------------
- "So, I asked people to read ahead on the lower-bounds of communications needed for matrix-matrix multiplication algorithms"
- We'll also look at a useful idea we've been dancing around called "iteration space"
- ...and then talk about how we can use ideas from parallel algorithms to also improve cache performance for serial algorithms
- Let's start by talking about iteration spaces - what are they?
- So, for matrix multiplication, we're trying to get the result
C = A x B
- Where, for individual elements/submatrices:
C(i,j) = sum from k=0...n { A(ik) x B(kj) }
- So, a naive serial algorithm for matrix multiplication looks like this:
for i:
for j:
for k:
c[i,j] += a[i,k] * b[k,j]
- We can represent this space as a 3D cube, with the axes i,j, and k
|\---------\
| \ C \
| \ \
j B k---------
\ | |
\ | A |
\| |
---------i
- On "A", we're iterating over the i/k plane, on "B" we're iterating over the j/k plane, and over "C" we're iterating through the i/j plane
- Inside the cube, a random integer coordinate "xyz" corresponds to some product, a multiplication operation we have to do
- So, what we've been trying to with our parallel decompositions is to take some portion of these operations and divvy them up among our separate processors, giving each processor a cube of operations or plane of operations or whatnot from this space to eventually use them to compute "C"
- So, if we take a slice from "A" and "B", and project those slices into the cube, their intersection will form a smaller cube/rectangular prism of operations that're needed, and projecting THAT onto C will give the resulting final values those calculations give us
- So, assuming we've got those slices, we want to figure out how many operations are needed, and vice-versa: if we have some volume of computations, what inputs will be need from A and B, and what output will we get?
- So, right now, the amount of memory moved by Cannon's algorithm (the best matrix multiplication algorithm we've seen so far) is "n^2/p"
- ..because to compute an "n*sqrt(p) x n*sqrt(p)" submatrix in C, we need "n" rows, each of size "n/sqrt(p)" size from the input matrices (double-check this?)
- But is this the BEST we can do? Can we do any better? How many multiplications do we NEED?
- Well, think of iterations spaces! We have some subset inputs from A "sA" and some subset of input operations "sB" from B, to get some subset of the resulting matrix C "sC"
- So, in intersection space, the minimum number of computations we need is the intersection of all of these splotches projected inside the cube
- What's the minimum size this could be? The smallest each input could be is a square "n*sqrt(p) x n*sqrt(p)" on each face, so the smallest intersection we could possibly get is a cube with sides of length "" (???)
- So, the number of multiplies has to be:
<= sqrt(|sA| * |sB| * |sC|)
- Let's say we split our algorithm into phases, each of which can communicate at most "M" pieces of memory to other processes per phase; then, by the above inequality, each of the subsets has to be <= |2M|
- That means that, per phase, there can't be any more than:
2*sqrt(2) * M^(3/2) multiplications
- Now, if each round does "W" computations, then the number of rounds "L" must act as follows:
L >= W/(2*sqrt(2)*M^(3/2))
- Therefore, L*M >= w/(2*sqrt(2)*sqrt(M)) - M
- Thus, if there are "p" processors, at least one of them HAS to perform "m*k*n/p" multiplications, so the entire number of words moved is:
m*k*n / (2sqrt(2)*p*sqrt(M)) - M
- Now, if we're multiplying two NxN matrices, that mean we have a minimum memory requirement of:
>= n^3 / (2*sqrt(2)*p*sqrt(M)) - M
=> sqrt(3)*n/sqrt(p) ~= sqrt(M)
=> 1/sqrt(M) ~= sqrt(3)*sqrt(p)/n
- So, for Cannon's algorithm, the amount of memory we're trying to send per round is the whole sub-matrix "n^2/p" (I think??), meaning that YES, Cannon's algorithm is optimal!
- ...not sure I see it?
- So, Cannon's algorithm is optimal in bandwidth, but what about latency?
- Assuming we only have enough memory to store 1 matrix on each processor, Cannon's is optimal; but if we have more memory, we can use 3D
- (confused by slides???)
- "So, that's the last algorithm we'll be looking at in the linear algebra part of the class - well, the dense linear algebra part at least"
- That brings us to the last bullet point of the day: how can these parallel algorithm ideas help us write serial algorithms?
- In the most basic model where we just have memory and a CPU then, by Brent's lemma, it doesn't help us much: serial algorithms will beat parallel ones
- HOWEVER, in modern processors, almost all computers have a cache of recently-used values - and while a computer architecture class would go super in-depth about where to put things in the cache and so forth, today we'll just assume we have a pretty smart cache that lets the user control where the data goes AND what data gets evicted, so we just need to worry about what memory we access
- In reality, a good default for real processors is to evict the least recently used piece of memory, although there's a BUNCH of different schemes
- Let's say we have a cache of size Z, and we write a loop that looks like this:
for i-1...n
for j=1...n
for k=1...n
C[i,j] = A[i,k] * b[k,j]
- How will the cache behave? Using iteration spaces, we can actually guess: the inner loops are the ones that change the fastest, so we're going upwards and bringing in a strip of A/B each loop, then moving along j to read in new strips of B...
- If each strip is "N" elements, then let's say the cache can hold N/2 elements - about half of each strip
- So, currently we're having NO cache reuse despite having ~2*n^3 memory movements to the cache for N^2 of these strips over the whole for loop
- How can we do better?
- Theoretically, we can do the most multiplication with values in the cache when we have 1/3 of the cache devoted to A, B, and C, respectively; in this case, the best case we can do is:
sqrt(Z/3)^3 ~= Z^(3/2)
- How many times would we have to refill this cache, then? There are n^3 total operations, so if we can do z^3/2 operations per full cache then the total amount of cache refills needed is:
n^3 / (Z^(3/2)) ~= (n/sqrt(Z))^3
- Similarly to where we had sqrt(p) in our parallel matrix lower-bound proof (??), we can achieve this optimal cache usage by basically simulating Cannon's algorithm with p = Z^2
- This essentially does a cube of calculations, then jumps to the next cube; the order doesn't matter too much, but the switch to cubes is HUGE!
- If we do this, and use the same set of memory for those Z^(3/2) operations, then we've changed the memory usage to:
2*n^3/sqrt(Z)
- So, we've gone from not using the cache at all to getting a speedup proportional to the size of the cache - that's awesome!
- Okay, that's time, so I'll post Homework 3 and the study guide for your perusing pleasure later - have a nice break!