//****************************************************************************//
//******************** First Day - January 7th, 2020 ************************//
//**************************************************************************//
- Professor Vuduc is here! I have no idea why, since he isn't the listed professor, but I'll take it!
- ...possibly because Toby Isaac is a first-time professor here?
- "And now, since you've signed up for this class, here's some targeted advertising from Professor Vuduc!"
- Apparently Professor Vuduc is building an undergraduate team to compete in the Supercomputing Cluster Competition, where students compete to build their own cluster and run it with a bunch of crazy tests (cutting power mid-run, power limits, etc.)
- This WILL count as a VIP if you have time, so yay!
- "I do NOT assume you have supercomputing experience for this VIP; instead, we'll teach you what you need to know as long as you have a core programming base"
- ALSO: there's an ungraded self-assessment quiz on Canvas. Please do it! If you're struggling, I'd recommend reading up on your algorithm notes
--------------------------------------------------------------------------------
- Alright, logistics!
- First up, office hours!
- Professor Isaac has office hours from 1-2pm on Tuesday, in CODA S1323
- Jinwoo Go is our TA, and'll have office hours Wednesday 1-2pm in Klaus 1361 (the common area)
- Prerequisites-wise, I assume you've taken some undergraduate class in algorithms analysis; "we won't do anything crazy, but you will have to analyze runtimes"
- There are NO required textbooks - "computing moves too fast, so what's the point?"
- Instead, all readings we expect you to do will be posted online
- Grading wise, the breakdown is:
- 20% homeworks
- 20% programming assignments (using MPI, a parallel library)
- 15% for EACH of our 2 midterms
- 30% for the CUMULATIVE final exam
- Homework-wise, you are allowed to discuss ideas with other people, but can't share source code
- For other homeworks, you ARE allowed to work with up to 2 other people, but have to name them as such on your homework - and if they don't name YOU, there'll be issues
- Alright, so let's start off with a question: what IS parallel computing?
- Let's assume we have our super-awesome (and awesomely-named) algorithm, "f(x)"
- To run this algorithm, our computer needs to consume resources, like time, energy, and money - but it already works!
- The fundamental problem of high-performance computing is to ask "okay, this algorithm works, but is it GOOD?"
- Annoyingly, in the real world, this answer depends on what computer we're running something on; O(n^2) might be perfectly fine on our super-powerful desktop, but be WAY too slow on an embedded system
- In your previous algorithms class, you probably worked in theory-land and ignored those, and pretended everything ran on a single-threaded ideal Turing machine
- For our purposes, here, "good" means that this algorithm not only works, but completes in a reasonable amount of time for our purposes, doesn't use more memory than we have, and consumes a reasonable amount of power
- In this class, then, we'll assume we're working with a computer that has PARALLELISM, i.e. it can run multiple operations at the same time
- Multi-core processors have become the norm these days, since we had a problem: as we made single cores faster, they consumed increasingly more power and produced increasingly more heat, eventually making heat a HUGE problem
- So, instead, people decided to put MORE cores on a single processor, rather than just making the individual cores faster
- So, we've got these parallel computers - what can we do with them?
- First off, we have to develop parallel algorithms that can take advantage of these multiple processors!
- Oftentimes, we'll need to model these algorithms slightly differently for particular architectures, so we'll need things like PRAM to represent these models
- Then, we have to PROGRAM this parallel algorithm using some language and library (e.g. MPI, UPC, OpenMP, CUDA, etc.), turning the algorithm into code, then compiling it into machine instructions
- We may also have to figure out how to build machines that can run our algorithms to their maximum capacity, optimized for parallel workloads and whatnot
- Finally, we have to assess the performance of our algorithm, both by itself in theory-land as well as on our specific machine
- So, how can we do all this? That'll be the focus of the rest of this course!
- We'll focus on algorithm analysis, then dive into the specifics of parallel algorithms, talk a bit about modeling for distributed networks and embedding, discuss general parallel techniques, and get into some of the specifics of implementing all this (using MPI in our case)
- (missed some of the course goals???)
- With that, let's take a brief peek at the kind of computer we're building these algorithms for
- Our computer has a processor that does operations and math and things, and has memory/data that holds all the stuff we want to work with; these two are then connected together
- Already, there's a few things we'd like to know about this computer:
- What's the clock rate of the processor (FLOPS)?
- "FLOPS" is "floating-point operations per second," and defines how fast we can get work done
- How much memory do we have?
- What's the BANDWIDTH between the processor/memory (i.e. how fast can we move data between the 2, in bytes/second?)
- What're the LATENCIES we need to worry about, e.g. how long does it take from giving a command to that command actually starting to execute/finish, or between requesting a piece of memory and actually getting it?
- Bandwidth is NOT the same thing as latency due to CACHING, meaning that recently-gotten data (or nearby data, since we retrieve memory in blocks) is faster to access than normal
- This means we ALSO have separate bandwidths/latencies between the cache/processor
- We also now need to worry about our CACHE SIZE
- This is a bit of a cartoon of an actual computer, but even at this stage, we can start to do some basic parallel level stuff using instruction-level parallelism and some clever caching tricks
- If we wanted to make it REALLY parallel, though, we can add ANOTHER processor/cache and connect it to our memory/RAM
- If we have a REALLY big system, we might also have a completely SEPARATE memory/caches/processors system, and connect the 2 separate pieces of RAM together
- We might also have a GPU, which has a BUNCH of tiny processors on it; "you can think of the GPU as taking slow bites, but biting off HUGE chunks of the problem every time it does something"
- Finally, for even BIGGER problems, we might have to connect totally separate systems with ALL these parts over a network
- This means we have to deal with network latency/bandwidth
- It also means we have to deal with the problems of having potentially totally different architectures/operating systems, different memory addressing schemes across the 2+ computers, etc.
- "...so, we've made our idea of a computer definitely more complicated. Don't worry if you don't get all of this right now; we'll revisit this kind of rough architecture idea many times"
- So, now that we have a more sophisticated idea of a computer, let's try to add some nuance to our idea of an algorithm, too
- We could think of an algorithm flowing as a DAG (directed acyclic graph), where we start with our inputs, flow to some processor, do some stuff in that processor, and then branch off to one of many potential outputs
- With a parallel processor, though, our DAG doesn't just have to flow to a single processor - we can split the pieces of our DAG across all the computer's components
- "Mathematically, that's basically what we're doing this semester; taking an algorithm and trying to map the pieces of its DAG to different pieces of computer hardware"
- Note that each of these edges from an operation to the next thing isn't free, but takes some time/resources to "traverse"
- Now, as for physical computers...
- There's a rough hierarchy of them from least-to-most powerful:
- Embedded devices
- Personal Mobile Devices
- (...)
- Superclusters
- As we've discussed, individual cores aren't getting all that more powerful now, but there's still a version of Moore's Law as we try to pack more processors on the same chips
- Bandwidth (or "throughput") has gotten MUCh faster, but somewhat at the cost of latency research (where we've only seen marginal improvements)
- When we're running the computer, every transistor switch takes some amount of energy, and power increases as we switch more frequently
- Slowing down the processor's clock speed reduces how much power we're using, but NOT the total amount of energy we need to complete some task
- So, the current trends in computer architecture are that:
- We can't leverage INSTRUCTION-LEVEL PARALLELISM (ILP) anymore (where the processor itself automatically takes each instruction and passes it off to free cores, I think?), as single processor performance has started plateauing (check what ILP is exactly?)
- INSTEAD, we need to leverage parallelism at the data, thread, and request level, which unfortunately means we need to restructure a lot of our algorithms and existing programs to take advantage of this
- There are 7 main types of parallel problems we'll need to worry about (although there are other, weirder edge cases):
- DENSE LINEAR ALGEBRA problems are where we're dealing with large blocks of data, and trying to solve systems of related equations
- SPARSE LINEAR ALGEBRA / GRAPH OPERATIONS are where we're dealing with less structured data, and need to worry more about branching and caching
- STRUCTURED GRIDS (or STENCILS) are between the two, and are where we're dealing with things like Lagrangians (e.g. using neighboring grid points) that have some regularity
- This is nicer, since we know exactly where the data is, but we can't take advantage of the parallelism and linear algebra like we can with dense problems
- SPECTRAL METHODS (e.g. the Fast Fourier Transform)
- PARTICLE METHODS are unstructured and dynamic (why?)
- MONTE CARLO / EMBARRASSINGLY PARALLEL / MAP REDUCE problems are where we have a LOT of work to do, but none of it is related or dependant on each other until the very end
- This means we don't need to do a whole lot of work to get stuff to run in parallel
- ...we'll ignore the ILP slides for now, since processors nowadays largely handle those by themselves and you don't really need to deal with that stuff unless you're designing a processor
- Alright; next class, we'll pick up with a simplified model of a computer, talking about how to start splitting up work, what "good" means for this class's purposes, and so forth. Stay tuned!