//****************************************************************************// //******* Future Event List / Intro. to Parallel - March 28th, 2019 *********// //**************************************************************************// - Alright, you should be working hard on your checkpoints, so one again, we're expecting: - Your problem statement / conceptual model - We do NOT need the empirical distribution/input analysis done yet? - "You should've started on it, and you'll need it for your final project, but it's okay if your simulations for the checkpoint are using placeholder data" - A "working version" of your simulator - it should be as BASIC as you can make it right now for the checkpoint, and then after that's past you should work to make it as accurate and refined as possible ------------------------------------------------------ - Alright, we've covered most of the "modeling lifecycle," and we're starting to get into implementation issues and parallel processing - but let's start off by revisiting the future event list - As you know, the future event list is just a priority queue, but there's been a LOT of research into different ways to actually implement this - Again, a future event list holds all the events we're waiting to process, and because it holds almost every event we care about it has a HUGE impact on performance - If the performance of this is good, we've usually got a fast simulation! - Per-event computation time usually isn't very high, but having to check and search through this list CONSTANTLY, for each event we need to check, adds up - So, what operations do we NEED for this priority queue? - Some way of inserting events with a given timestamp, some way of deleting the smallest timestamp event, and a way of deleting arbitrary events (to handle interrupts, or unexpected situations where scheduled events are now invalid) - If we're making a general-purpose event list, we also need a few extra constraints: - Can handle an arbitrarily large number of elements - Can efficiently handle cases when we don't know the order of insert/deletion operations, or the distribution of timestamp times (do we need to worry about microsecond-scale differences? Months-long differences? Both at once?) - These requirements SCREAM linked list as an obvious data structure choice, right? But is it obviously the best? - Tree-based structures, like heaps and splay trees, are also pretty common - having a worst-case access time of log(n) is MUCH better than O(n) when we're inserting a new event - In particular, some priority queues based on hashing can give us amortized constant-time performance, which is great! - To evaluate the performance of a particular FEL, we'll usually use the HOLD model to do empirical performance evaluation - This is based on the "hold" primitive (or "AdvanceTime" - they're the same thing), where for a given fixed queue size we'll schedule a new event and then delete an existing one (enqueue + dequeue time), holding the number of elements constant - Using this model, we can see that linear linked lists are the fastest for small lists (< 10 elements), but it EXPLODES if you have a large event list - At large sizes, then, we need more efficient data structures, like tree-based heaps (Splay trees, etc.) - Further complicating this, our exact results will vary a bit between machine architectures (some data structures benefit significantly from caching, for instance) - Hullo, did we say "constant time" somewhere? That's exciting! I WANT, HOW GET? - Well, we do this by using hashing-style methods - and for priority queues, that means CALENDAR QUEUES - The idea behind calendar queues is that we'll have an array of buckets, each covering some discrete interval of a time (e.g. 365 buckets for each day of the year) - Each "bucket" will have a simple singly-linked list queue to hold all the events that fall inside that bucket (sorted or otherwise) - We'll then have a pointer to the current bucket/day, so that we can get to it right away - ...and that's it! - Insertion is pretty easy: take the timestamp, map it to the appropriate bucket, and throw it into that bucket's queue - If an event "wraps around" to the bucket but is for a later period (e.g. Thursday of next year), we still put it in the same bucket - Deleting, though, is a little more tricky: - We'll start by looking in today's bucket; if none of the events from this period match, we'll go to the next bucket, then the next, and so on until we find what we're looking for - A few problems come to mind with this, though: - Within the buckets, we still might get a long list, which'll degrade to singly linked-list performance - When deleting, we also might need to search through a LOT of buckets, and particularly empty buckets - To avoid this, if the number of events in the queue gets too large we'll occasionally resize the calendar so that there are only 1-2 events per bucket - To do this, we literally just make a larger event calendar, and then copy all of our existing events over to the new one - it's expensive! - How's the performance of this Calendar queue? Well, for a small number of events, linear lists still win - but for large queues, the HOLD model shows it massively outperforming other data structures - ...but remember something important about the HOLD model? The list size is constant! That means the queue never has to do an expensive resizing operation! - As it turns out, the calendar queue performs VERY well when the number of events and the average timestep doesn't change very much, but can approach O(n) when there's a huge influx of events, or a lot of events that have similar timesteps - which is AWFUL! It's as bad as a linear linked list! - Because of those problems, an alternative exists, known as the LADDER QUEUE - This is a hybrid data structure, consisting of 3 layers: - A TOP layer, which holds an unsorted list of events in the "far future" we're not too worried about - Here, we'll just store the max/min timestamps and the number of events - A MIDDLE layer, which is a kind of multi-level calendar queue and where the ladder name comes from - How does that work? Well, as we mentioned, the problems with calendar queues mostly come from a lot of events falling into a single bucket - To fix this, if too many events fall into a single bucket, we'll spawn a new calendar queue JUST to hold that bucket's events! - And if a bucket in THAT queue fills up too much, then it'll recursively split into a new "rung" of the ladder as well! - A BOTTOM layer, which is a sorted list holding the "near future" events we're most worried about - This is the layer we'll dequeue from most often - So, the big ideas behind this data structure are that: - Each "rung" of our middle-layer level has a different bucket size - We try to delay sorting until events are just about to be dequeued; insertions to the middle/top layer aren't sorted at all, which means our resizing calendars can't mess up any sorting we did - In summary for future event lists, then, priority queue performance is an important issue, ESPECIALLY for very large simulations with many millions of events - Linear lists are the fastest if things are small, O(log n) structures like heaps and splay trees are reliable, Calendar queues are blazingly fast but with major downsides in some cases, and Ladder queues are complicated to implement but can often give us the best of both worlds - ...so, for the rest of the semester, we'll be talking about parallel event simulations - let's dive right in! - PARALLEL SIMULATION basically means we're taking a single program, chopping it up into pieces, and executing on multiple TIGHTLY COUPLED processors (e.g. different cores all sharing memory on a board) - DISTRIBUTED SIMULATIONS still involve executing a single program, but we're now running them on LOOSELY coupled processors, e.g. over the internet (meaning processes can't respond to each other as quickly) - There's a 3rd type, REPLICATED TRIALS, that we don't care as much about as much - it just means running the same program with different conditions on different computers, without any communication whatsoever - Years ago, we cared about parallel processing because researchers had access to supercomputers and wanted to take advantage of that extra power to do things faster, or because their program needed more memory to run than a single computer had - Nowadays, though, parallel processing is EVERYWHERE: every laptop processor has multiple cores! - We might also care because we have resources that are distributed over a large space, and we need to coordinate those different computers (e.g. multiplayer videogames) - Interoperability is yet another concern, where we need to have multiple separate programs all running together in coordination (e.g. a military simulation that has to use "tankEngine.java"'s results to calculate the effect of "shootEnemy.py," and needs them both to run in synch) - If you need to deal with any of these problems, you should consider doing parallel processing! - So, those are the definitions, but how did we get here in history? - Ridiculously briefly, parallel processing started in the "High-Performance Computing Community" in the early 1980s with the "Chandy-Misra-Bryant" and "Time Warp" algorithms - These are respectively known as "conservative" and "optimistic" synchronization algorithms, which we'll distinguish between later - At the same time, the defense community was having trouble with many different systems that didn't work together, and in 1983 they started the SIMNET project to emphasize reusing old simulations for new purposes - They started pushing for standards to keep their simulations interconnected, and this continued into the 1990s - The military was also interested in having multiple computer simulations that they could hook together, generating yet more funding - Those defense initiatives eventually led to the more generalized IEEE standard "High Level Architecture," which was first published in 1996 and was last revised in 2010 - Finally, an important push for parallel simulation came from the multiplayer gaming community in the 1990s, which needed to figure out how to coordinate multiple computers playing the same game together across the globe - What does a parallel discrete event simulation actually look like, then? - Well, for each core/computer we're running the simulation on, we'll have a LOGICAL PROCESS (LP) that can schedule events for other LPs (which'll require sending messages to those other logical processes) - More generally, think of it like this: - We have a "physical system" we want to model, composed of a bunch of independent, interacting physical processes - For each physical process, we'll create an LP - and when those processes need to interact/affect one another, their LPs will send messages and schedule events on each other! - For instance, think about our airport example, but this time with MULTIPLE airports - we'll have a process for each airport! - All information we share between different processes HAS to be done via message-passing (no shared state allowed, or the processes aren't independent!) - What do the code changes look like for this? Not much! - The ONLY change to the airport code we need to make is that when we're processing a departure, we'll schedule an arrival event at a DIFFERENT airport instead, and the time to get there will be based on the airport we're flying to/from - So, that sounds really easy! But what's the catch? - At first, it seems like this LP thing is suited to concurrent events really naturally - we're just changing a physical process to a logical process, and each physical interaction into an event/message - But back when we had everything on a single processor, we had a guarantee of the all-important LOCAL CAUSALITY CONSTRAINT: our single future event list meant we ALWAYS processed all events in time-stamp order! - The problem NOW is that when we begin to process an event, there's a chance that another process will add a new event that should happen BEFORE that one starts - e.g. What if we start processing the next scheduled airplane landing at 10:00pm, but then Much Closer International Airport schedules a landing that should've happened at 9:15? TIME WOULD BE BROKEN, and physics would be sad :( - If our simulation had other details, this would cascade into passengers missing layovers, nonsensical trip orderings, etc. - This is known as the SYNCHRONIZATION problem, and it's a big one: how can we ensure that all events are processed in the right order across ALL processes? - If we can ensure that all these events are synchronized, we're good - a theorem guarantees this is sufficient for the output to be identical to a sequential program - We need an algorithm to ensure these things stay in sync, then, and there are 2 broad categories of algorithms to do that: - CONSERVATIVE ALGORITHMS try to always avoid violating the LCC until we know it's safe; they try to avoid making ANY mistakes at all - The problem here is figuring out how to avoid deadlock/livelock, detect it in advance, etc. - OPTIMISTIC ALGORITHMS allow these violations to occur, but then try to fix them; it allows errors to happen but tries to correct them as it runs - The problem here is figuring out WHEN an error has occurred, and how to reverse it - One of the oldest, most famous conservative algorithms is the CHANDY-MISRA-BRYANT (CMB) ALGORITHM (or "Null Message algorithm") - To use this algorithm, we need to make 4 assumptions first: - All the logical processes are exchanging messages with time stamps - The network topology isn't changing (i.e. no logical processes are being created or deleted) - Messages on each link are sent in time-step order (i.e. from a given process, any later-sent messages from that process will really be later) - The network delivers things reliably in their sent order - Based on those above assumptions, we know that the last message received on a specific link will be GUARANTEED to be the the lowest timestamp from that process - How does that help us? Because it means that the lowest timestamp message across ALL the processes we know about will be the lowest one we ever have to deal with! - So, the simplest way we can use this is an algorithm like this: while LP is running: Wait until each FIFO contains at least one message Remove smallest time stamped event from all FIFOs - This looks good, right? But the problem is that this can lead to DEADLOCK: two separate LPs can be waiting to receive a message from the other, which stops the whole system! - To get around this, the CMB algorithm will have each LP send a NULL MESSAGE saying the minimum time it'll take for it to send a new message - So, let's say that we're at timestep 0, and are listening to 2 processes. One process gives us a null message saying that it won't send any messages for at least the next 3 seconds - What does that mean? Since 0+3 = 3, it means that ANY message we get from the other process with a timestamp less than 3 is safe to process! - What problems could arise from this, though? And is there any way to fix them? We'll keep talking about THAT - and much, much more - next week!