Jake's CS Notes - Intro. to HPC

//****************************************************************************//
//************** Homework 3/Exam 2 Review - March 31st, 2020 ****************//
//**************************************************************************//

- Okay, Professor Isaac is starting to teach online via iPad!
    - ...I'm getting some serious Khan Academy vibes

- Some starting announcements:
    - Homework 3 was due yesterday, and the solutions should've been posted right before this class
    - Programming Project 2 grades should be posted by this afternoon
    - Exam 2 is THIS THURSDAY - be prepared!
        - The exam will "go live" at Midnight (12am on April 2nd), and'll be due at 11:59pm; it'll be an online exam, and will be open note/open book
            - Basically, the only thing you're NOT allowed to use are collaboration with other peers; it should be entirely your own work on the exam
        - I'll definitely be available for questions/exam clarifications during the regular class time from 9:30-10:45, and should be able to do an afternoon "office hour" section ~2:30-3:30 as well
            - In other words, START THE EXAM EARLY!
            - I'll also do my best to answer emails that evening as well, but don't count on an immediate response
    - Solutions to the Exam 2 study questions

- For general problems, I don't expect any problems to come out of left field; look at the approach I'm doing in the posted solutions and try to approach problms in a similar way, and you should be off to a pretty good start
--------------------------------------------------------------------------------

- Alright, let's talk about the homework 3 questions!

- Question 1 was about sorting a list given 2 already-sorted subsquences, one of length "m" and another of length "n"
    - The basic idea was that if we could reverse one of the sequences, we'd end up with a bitonic sequence, which we could then use bitonic merging on to get a final sorted list
        - Most people got the gist of this, but for full points you needed to be careful about HOW you were doing these steps - particularly the reversal!
        - While we can assume p divides "m + n", we DON'T know if the division between the 2 sequences happens at a processor boundary; it's possible there's a processor with elements from BOTH sequence 1 and sequence 2, and that's where the split is!
            - So, because of that, it's possible the last processor will need to send its elements to BOTH the 1st and 2nd processor w/ elements from sequence 2 to keep "m+n/p" elements on each processor (basically, we have to reverse the 2nd sequence and then "shift" its elements forwards to avoid having too many elements on the boundary processors/too few elements on the last processor)
                - If we did a reversal with each processor ONLY communicating with 1 other processor, then the 1st processor that contains a PART of sequence 2 would end up receiving "(m+n)/p" elements and end up having more than "(m+n)/p" elements, making our bitonic merge step trickier (since we'd have to then have a special case to compare the extra elements from uneven list sizes during the comparison step against a DIFFERENT processor)
            - However, at MOST a given processor will need to communicate with 2 other processors to shift its elements correctly, which doesn't affect the asymptotic runtime of the reversal:
                - Comp: O((m+n)/p)
                - Comm: O(t + u(m+n)/p)
    - Then, for the bitonic merge itself, we have a logarithmic runtime:
        - Comp: O(log(p) * (m+n)/p)
        - Comm: O(log(p) * (t + u*(m+n)/p))

- Question 2 unfortunately was worded more confusingly than I'd realized, but the question was how to accomplish permutation routing using some existing parallel sort algorithm given to us
    - If you've looked at generic sorting libraries like C's "qsort", you have to provide the datatype of your elements along with a comparison function telling the algorithm how to compare your datatype
        - So, the work will be proportional to the number of times compare gets called, and the memory movement will be proportional to the size of the elements you're comparing/swapping between processors
    - Extending this logic to parallel sorting, if we're given the computation time for sorting T_comp(n,p), that'll be proportional to the max number of comparisons any given processor performs
        - Given the communication time T_comm(n,p), then, there'll be some overhead + a time proportional to the size of the elements
    - For this problem, when we're using sorting to actually get each message to its destination, we'll have a new datatype containing the message AND the destination processor's rank, giving us a size of "m + O(1)" (message size + some constant number of bytes)
        - So, for the computation, we have T_comp(p,p)
        - Then, for the communication, we'd have "p" comparisons but with each message being of size "m+1", which we can represent as T_comm((m+1)p, p)
            - For the communication time
    - Again, apologies for the wording, since I think this was actually a fairly straightforward problem that was worded somewhat confusingly
        - A brief tangent on network design (won't be tested): building all-connected networks is SUPER expensive due to how many wires it requires, so what's sometimes done instead is building a "crossbar" network that looks like a grid - but (I think?) this can require traversing up to "sqrt(n)" links to send a message
        - Alternatively, we can instead have a single big switch that looks like a sorting network in its processor connections, and if we build it correctly this can actually route any permutation without congestion at the cost of a few extra hops (up to log(p))!
            - Instead of using p^2 switches, then, this'd instead use p * log(p)^2 switches (double-check this?), which is better!
            - So, this algorithm DOES come up at the network-design level, even if you'd never write an application program likes this

- Question 3, then, was about how to embed a ring in an array while having at most 2
    - If we embed the ring directly in the array using the same IDs, then all but 1 of the connections is the same - but the "loop around" connection would go from having 1 link to having traverse n-1 links, which is bad! That's a dilation of "n-1"!
    - So, what's the answer? Basically, it's taking a donut and squishing it flat!
        - Essentially, we need to interleave the 2 halves of our ring, like so:

              7---6---5---4
             /           /      => 0-7-1-6-2-5-4
            O---1---2---3

        - Mathematically, we can express this as:

                F(i) = 2i,          if i < p/2
                     = 2p-2i-1,     otherwise

    - This gets us an output network with a dilation of 2, which is what the problem is asking for!
        - This problem wasn't too tricky once you saw it; you mostly
        - A slightly trickier example of dilation would be embedding a ring with n^2 connections in an NxN mesh; if N is odd, then one embedding (not necessarily the best one) would be to just "snake" the ring like so:

                -------|
                |-------
                -------|

            - HOWEVER, this doesn't preserve our ring's 1st/last connection! In general, this'd give us a dilation of 2(n-1)
                - Remember, DILATION is the worst-case number of links we have to traverse for a single link in the original source graph

- Finally, we have question 4, which was about transposing an NxN matrix in parallel on an NxN torus mesh - BUT in a way that each processor held at most an O(1) number of matrix elements at any point
    - Here, if you just wrote an algorithm like "every entry is passed to the correct row, wait, then pass it to the correct column," then there'd be a problem: you'd end up with a bottleneck where every processor in the last row would have all the processors for a given column, which'd mean that processor would be a bottleneck and take "N" operations/communication rounds just to move each element 1 space over
    - However, remember that a transpose sends each element (i,j) to (j,i); since our NxN torus wraps around, we know the shortest distance between (i,j) and (j,i) is <= N
        - So, to avoid this pile up issue, we just DON'T WAIT for each element to "turn the corner" and go to the correct column
            - How do we guarantee that this DOESN'T cause a pile-up? The solution guide from previous classes was actually kinda vague about this
            - The reasoning, though, goes like this:
                - Each round, each process has at most 1 outgoing message in each of the N/S/E/W direction
                - Therefore, each process has at most 1 incoming message in each of those directions each round
                - Therefore, if each process has at most 5 items (1 it keeps at its final location, 1 it sends in each direction), then this will be true after each round completes!
                - Therefore, the number of messages on any one process never exceeds 5
    - That's a correct solution, and the one we gave from previous semesters; alternatively, there's a simpler solution to understand that's a little less efficient BUT makes use of embeddings
        - One way we can understand this problem is that each element needs to move along an "L" to get to its final destination in the transpose:

                | | |
               -  | |
               ---  |
               ------
                        (...)

        - If we can move these elements along these "L"s without them building up, then we'll have done this transpose successfully without causing bottlenecks!
            - Can we do this? Yes! We just need to reverse each "half" of the "L," and then move the elements past each other!
            - So, we can do the reversal in T(n/2) steps, which for the whole array will be n/2 + n/4 + ... = n communication rounds, and then we can slide the elements past each other in "n" steps
            - Since we're just sliding 2 lines past each other and over the main diagonal, this'll never build up more than 3 elements on a given processor!