//****************************************************************************// //******************** First Day - January 7th, 2020 ************************// //**************************************************************************// - Professor Vuduc is here! I have no idea why, since he isn't the listed professor, but I'll take it! - ...possibly because Toby Isaac is a first-time professor here? - "And now, since you've signed up for this class, here's some targeted advertising from Professor Vuduc!" - Apparently Professor Vuduc is building an undergraduate team to compete in the Supercomputing Cluster Competition, where students compete to build their own cluster and run it with a bunch of crazy tests (cutting power mid-run, power limits, etc.) - This WILL count as a VIP if you have time, so yay! - "I do NOT assume you have supercomputing experience for this VIP; instead, we'll teach you what you need to know as long as you have a core programming base" - ALSO: there's an ungraded self-assessment quiz on Canvas. Please do it! If you're struggling, I'd recommend reading up on your algorithm notes -------------------------------------------------------------------------------- - Alright, logistics! - First up, office hours! - Professor Isaac has office hours from 1-2pm on Tuesday, in CODA S1323 - Jinwoo Go is our TA, and'll have office hours Wednesday 1-2pm in Klaus 1361 (the common area) - Prerequisites-wise, I assume you've taken some undergraduate class in algorithms analysis; "we won't do anything crazy, but you will have to analyze runtimes" - There are NO required textbooks - "computing moves too fast, so what's the point?" - Instead, all readings we expect you to do will be posted online - Grading wise, the breakdown is: - 20% homeworks - 20% programming assignments (using MPI, a parallel library) - 15% for EACH of our 2 midterms - 30% for the CUMULATIVE final exam - Homework-wise, you are allowed to discuss ideas with other people, but can't share source code - For other homeworks, you ARE allowed to work with up to 2 other people, but have to name them as such on your homework - and if they don't name YOU, there'll be issues - Alright, so let's start off with a question: what IS parallel computing? - Let's assume we have our super-awesome (and awesomely-named) algorithm, "f(x)" - To run this algorithm, our computer needs to consume resources, like time, energy, and money - but it already works! - The fundamental problem of high-performance computing is to ask "okay, this algorithm works, but is it GOOD?" - Annoyingly, in the real world, this answer depends on what computer we're running something on; O(n^2) might be perfectly fine on our super-powerful desktop, but be WAY too slow on an embedded system - In your previous algorithms class, you probably worked in theory-land and ignored those, and pretended everything ran on a single-threaded ideal Turing machine - For our purposes, here, "good" means that this algorithm not only works, but completes in a reasonable amount of time for our purposes, doesn't use more memory than we have, and consumes a reasonable amount of power - In this class, then, we'll assume we're working with a computer that has PARALLELISM, i.e. it can run multiple operations at the same time - Multi-core processors have become the norm these days, since we had a problem: as we made single cores faster, they consumed increasingly more power and produced increasingly more heat, eventually making heat a HUGE problem - So, instead, people decided to put MORE cores on a single processor, rather than just making the individual cores faster - So, we've got these parallel computers - what can we do with them? - First off, we have to develop parallel algorithms that can take advantage of these multiple processors! - Oftentimes, we'll need to model these algorithms slightly differently for particular architectures, so we'll need things like PRAM to represent these models - Then, we have to PROGRAM this parallel algorithm using some language and library (e.g. MPI, UPC, OpenMP, CUDA, etc.), turning the algorithm into code, then compiling it into machine instructions - We may also have to figure out how to build machines that can run our algorithms to their maximum capacity, optimized for parallel workloads and whatnot - Finally, we have to assess the performance of our algorithm, both by itself in theory-land as well as on our specific machine - So, how can we do all this? That'll be the focus of the rest of this course! - We'll focus on algorithm analysis, then dive into the specifics of parallel algorithms, talk a bit about modeling for distributed networks and embedding, discuss general parallel techniques, and get into some of the specifics of implementing all this (using MPI in our case) - (missed some of the course goals???) - With that, let's take a brief peek at the kind of computer we're building these algorithms for - Our computer has a processor that does operations and math and things, and has memory/data that holds all the stuff we want to work with; these two are then connected together - Already, there's a few things we'd like to know about this computer: - What's the clock rate of the processor (FLOPS)? - "FLOPS" is "floating-point operations per second," and defines how fast we can get work done - How much memory do we have? - What's the BANDWIDTH between the processor/memory (i.e. how fast can we move data between the 2, in bytes/second?) - What're the LATENCIES we need to worry about, e.g. how long does it take from giving a command to that command actually starting to execute/finish, or between requesting a piece of memory and actually getting it? - Bandwidth is NOT the same thing as latency due to CACHING, meaning that recently-gotten data (or nearby data, since we retrieve memory in blocks) is faster to access than normal - This means we ALSO have separate bandwidths/latencies between the cache/processor - We also now need to worry about our CACHE SIZE - This is a bit of a cartoon of an actual computer, but even at this stage, we can start to do some basic parallel level stuff using instruction-level parallelism and some clever caching tricks - If we wanted to make it REALLY parallel, though, we can add ANOTHER processor/cache and connect it to our memory/RAM - If we have a REALLY big system, we might also have a completely SEPARATE memory/caches/processors system, and connect the 2 separate pieces of RAM together - We might also have a GPU, which has a BUNCH of tiny processors on it; "you can think of the GPU as taking slow bites, but biting off HUGE chunks of the problem every time it does something" - Finally, for even BIGGER problems, we might have to connect totally separate systems with ALL these parts over a network - This means we have to deal with network latency/bandwidth - It also means we have to deal with the problems of having potentially totally different architectures/operating systems, different memory addressing schemes across the 2+ computers, etc. - "...so, we've made our idea of a computer definitely more complicated. Don't worry if you don't get all of this right now; we'll revisit this kind of rough architecture idea many times" - So, now that we have a more sophisticated idea of a computer, let's try to add some nuance to our idea of an algorithm, too - We could think of an algorithm flowing as a DAG (directed acyclic graph), where we start with our inputs, flow to some processor, do some stuff in that processor, and then branch off to one of many potential outputs - With a parallel processor, though, our DAG doesn't just have to flow to a single processor - we can split the pieces of our DAG across all the computer's components - "Mathematically, that's basically what we're doing this semester; taking an algorithm and trying to map the pieces of its DAG to different pieces of computer hardware" - Note that each of these edges from an operation to the next thing isn't free, but takes some time/resources to "traverse" - Now, as for physical computers... - There's a rough hierarchy of them from least-to-most powerful: - Embedded devices - Personal Mobile Devices - (...) - Superclusters - As we've discussed, individual cores aren't getting all that more powerful now, but there's still a version of Moore's Law as we try to pack more processors on the same chips - Bandwidth (or "throughput") has gotten MUCh faster, but somewhat at the cost of latency research (where we've only seen marginal improvements) - When we're running the computer, every transistor switch takes some amount of energy, and power increases as we switch more frequently - Slowing down the processor's clock speed reduces how much power we're using, but NOT the total amount of energy we need to complete some task - So, the current trends in computer architecture are that: - We can't leverage INSTRUCTION-LEVEL PARALLELISM (ILP) anymore (where the processor itself automatically takes each instruction and passes it off to free cores, I think?), as single processor performance has started plateauing (check what ILP is exactly?) - INSTEAD, we need to leverage parallelism at the data, thread, and request level, which unfortunately means we need to restructure a lot of our algorithms and existing programs to take advantage of this - There are 7 main types of parallel problems we'll need to worry about (although there are other, weirder edge cases): - DENSE LINEAR ALGEBRA problems are where we're dealing with large blocks of data, and trying to solve systems of related equations - SPARSE LINEAR ALGEBRA / GRAPH OPERATIONS are where we're dealing with less structured data, and need to worry more about branching and caching - STRUCTURED GRIDS (or STENCILS) are between the two, and are where we're dealing with things like Lagrangians (e.g. using neighboring grid points) that have some regularity - This is nicer, since we know exactly where the data is, but we can't take advantage of the parallelism and linear algebra like we can with dense problems - SPECTRAL METHODS (e.g. the Fast Fourier Transform) - PARTICLE METHODS are unstructured and dynamic (why?) - MONTE CARLO / EMBARRASSINGLY PARALLEL / MAP REDUCE problems are where we have a LOT of work to do, but none of it is related or dependant on each other until the very end - This means we don't need to do a whole lot of work to get stuff to run in parallel - ...we'll ignore the ILP slides for now, since processors nowadays largely handle those by themselves and you don't really need to deal with that stuff unless you're designing a processor - Alright; next class, we'll pick up with a simplified model of a computer, talking about how to start splitting up work, what "good" means for this class's purposes, and so forth. Stay tuned!