Jake's CS Notes - Intro. to HPC

//****************************************************************************//
//******************* Sample Sort - February 20th, 2020 *********************//
//**************************************************************************//

- ...a rainy, empty day, corresponding to a semi-rainy, still-very-empty class
    - (Seriously, it's 1 minute to class start and there are only 9 people in the room!)

- But, here're some notes hot off the whiteboard!
    - Exam 1 grading will be done ASAP (likely early next week)
        - Homework 2 should've already been graded; we were late, so apologies for that
        - For the exam, we'll look at each question individually and curve based on those individual scores
    - Programming assignment 2 will be released next week on Tuesday, and'll be due 2 weeks later
        - After that, Homework 3 will come out (~March 8th or so?)
    - ALSO: due to some obligation stuff, there'll be office hours from 1-2 tomorrow

- Today's outline:
    - Review sorting thus far
    - Sample sort

- "Now, if you'll excuse me briefly..." *Professor Isaac mysteriously leaves the room, then returns*
--------------------------------------------------------------------------------

- So, if we know we're sorting in parallel, there are already some things we can say about the algorithm
    - Since serial comparison sorts have a lower-bound of O(n log n), then (per our "no superlinear speedups" rule), we have the following lower bounds on parallel sorts:

            Computation: O(n log n / p)
            Communication: O(u * n/p)

        - Keep in mind that sorting can be highly dependant on the order of the elements, so this is a worst-case communication scenario
    - We then looked at Bitonic Sort, which had the following runtimes:

            O((n * log(n/p) + n* log^2(p))/p )
            O(log(p)^2(t + u*n/p))

        - So, there are log(p)^2 data movements - can we do better, since this is over the theoretical minimum? Is there a way we can determine where each element should go in advance?
            - The "log^2" term comes since for each comparison, an element could go to one process or the other (i.e. 2 possibilities for each element for log(p) rounds)
    - Next, we looked at parallel quicksort, where we:
        - Pick a pivot somehow and broadcast it in O(log p(t + p))
        - Partition locally in O(n/p)
        - Re-arrange the partitions based on the pivot in O(t + un/p)
        - Split the communicators and recurse
        - Repeat until p=1, then locally sort
            - So, in the expected case of a near-perfect split, we'd have O(log p) iterations before our local sort, and we'd get:

                    Comp: O(n/p*log(p) + n/p*log(n/p))
                    Comm: O(t log(p) + un/p * log(p))

                - So, communication wise, this is more efficient since there are less rounds of communication!
                    - It is still quick-sort, though, so the worst-case in case of a bad split is MUCH worse

- As we talked about WAY back in last lecture, though, choosing a random pivot doesn't work for parallel since we end up with an expected split of 25/75 for the array, rather than 50/50
    - This is because each round is limited by the MAXIMUM disparity in each array, changing our average
        - This gets us a new runtime of:

                T(n,p) = O(n/p) + T(n*3/4, p/2)
                    = (n/p) * sum{i=0 to log p-1, (3/2)^i} + (3/2)^log(p) *(n/p) * log((3/2)^log(p) *(n/p))

            - (i.e. the cost of all the splitting rounds, followed by doing an O(n log n) local sort)
            - Note also that:

                    (3/2)^(log(p)) = 2^(log(p))^log(3/2) = p^(log(3/2)) ~= 0.6

    - So, instead of shrinking by a factor of p for random pivots, we're ONLY shrinking by a factor of p^0.6 (?)

- Alright, this sub-optimal performance is coming from unbalanced arrays, so how can we choose a better pivot? By using the TRUE MEDIAN algorithm
    - ...this assumes we get perfect load-balancing after recursion (?), but we'll ignore that for now
    - If we can do this, we end up with a NEW sorting runtime of:

            T(n,p) = T_medianFinder(n,p) + O(n/p) + T(n/2, p/2)

    - Great! Can we do this?
        - We want to find the median of a list; obviously, we could do this by sorting, but that takes an ugly O(n log n)
        - We could then use the serial "median of medians" algorithm from CS 3510 by partitioning the array into 5-element chunks, finding the median of each of THOSE in constant time, recursively doing this for all the medians we found, and then redistributing the list into elements greater, equal to, or smaller than the median (I think?)
            - As we saw in our pre-test, we can prove this algorithm completes in O(n)
            - That's better, but it's not quite fast enough - can we throw any parallel algorithmic magic at it?
                - Unfortunately (something from slides), which leads to load imbalancing from our recursive step, since each process might have a different-sized subset of the array as we sort this thing (e.g. the extreme case of every processor but 1 having 1 element, and a single processor having every other "n-p" element)
            - How can we solve this? We don't - we ignore it!
                - Instead, we find the local median for our given sub-array, assign weights to each median (?) based on how many elements it was found on, and choose a global median from that
                - This gets us an array size of at MOST O(3/4n) when we recurse; we'll skip the proof for now
        - So, using this parallel median-of-median algorithm, we end up having log(p) communication rounds, giving us costs of:

                Comp: O(n/p * log(n))
                Comm: O(t*log(p) + u(p*log(p) + n/p))

    - So, if we use this algorithm to pick our pivot every round, we'll have a computation time that looks like:

            Comp: n/p(log(n) + log(p))
            Comm: log(p)((t + un/p) + t*log(p) + u(p*log(p) + n/p))
                ~= O(un/p * log(p)), if we ignore t

        - So, looking back at our theoretical lower bounds, this is computationally the best case possible so long as p < n, and we only have a single log term for communication - not quite ideal, but better than for bitonic sort!
            - However, this still has log(p) rounds of communication, just like bitonic sort

- So, we want to try and reduce the number of communication rounds - and that brings us to SAMPLE SORT!
    - The idea here is that we try to split our arrays into MULTIPLE groups to avoid rounds of communication
    - A naive first attempt might look like this:
        - Sort the array locally, in O(n/p * log(n/p))
        - Split the local sequence evenly into "p" segments of size n/p^2
        - All-to-all communication
        - Locally merge your "p" sequences
    - So, as long as p < n, we've got a runtime of:

            Comp: O(n*log(n) / p)
            Comm:

        - Why doesn't this work? While each of the individual chunks will be sorted in themselves, we CAN'T assume that the chunks of "p" elements will be sorted when we put them back together!
    - So, we have to choose some pivots that ALL the processes use to guarantee this, so let's try again:
        - Sort locally
        - Pick p-1 random samples globally, a la quicksort, with an  O(p*log(p)) communication cost
        - Split the local sequence into "p" segments according to these samples
        - All-to-all communication
        - Locally merge the "p" sequences
    - So, in the best case, this'll perform just like quick-sort!
        - However, choosing the pivots randomly means we can have a worst-case runtime of O(n) - which, because we're running this in parallel, is actually kinda likely!
    - What can we do, then? We do SAMPLE SORT:
        - Sort the arrays locally
        - Choose p-1 "good" splitters
            - By "good splitters," we mean pivots that:

        - Split the local sequences into "p" segments using the splitters
        - (...)
    - How can we find these good splitters? Like this:
        - Pick p-1 equally-spaced elements from the local sorted array on each processor
        - Sort all p*(p-1) local splitters using bitonic sort
        - Pick p-1 global splitters (from each local array, pick the last local splitter on each processor) (?)
        - Allgather the global splitters to each process
    - Based on this, we have a THEOREM: Each processor is GUARANTEED to receive at most 2*n/p elements if we choose pivots via this method
        - (VERY confused by this proof???)
        - Why? Well, the number of local splitters contained in the elements each processor has received is p-1
        - Let "si" be the number of local splitters that came from process "Pi"
        - Now, the maximum number of these local splitters that came from a given Pi is < (si+1) * n/(p^2)
            - Therefore, the total number of elements received by each processor is:

                    < sum{i=0 to p-1, ((s_i+1) * n/p^2)}
                    = sum{i=0 to p-1, (s_in/p^2)} + sum{i=0 to p-1, (n/p^2)}
                    (more on the slides I missed)

    - Okay; given these steps, we know bitonic sort on p^2 entries will take O(p log^2(p)) with log^2(p) communication rounds (each with a message size of "p" (?)), and the allgather will take log(p) communication rounds
    - In total, then, we've got a runtime of:

            Comp: O(n/p * log(n/p) + n/p * log(p) + p*log^2(p))
                = O(n*log(n)/p + p*log^2(p))
            Comm: O(tp + un/p + (t+up)log^2(p)) = O(tp + u(n/p + p*log^2(p)))

        - So, the runtime of this algorithm is gonna be based on the ratio of "n" to "p" - and, as it turns out, this'll be optimal so long as:

                n > p^2 * log^2(p)

    - In other words, this sort is optimal for LARGE arrays - cool!
        - The exact size may also vary depending on the machine characteristics; if u*p*log^2(p) is smaller than tp, then this might not be the best algorithm to use

- Alright, see you all next week!