Distributed Dragons: Why Distributed Algorithms Are Different
// August 29th, 2010 // Uncategorized
The Perfect Question
For anyone who has ever tried to teach, you rapidly learn that the chief problem to learning is not failing to understand answers, but failing to ask the right questions. The best students usually have an intuition of how questions dovetail together, a creativity it formulating them, and a general talent for finding the right way to distinguish subtle differences. Teachers often love these students, though their classmates might not appreciate them so much.
As a professional developer, I often find myself in the role of teacher. Not to suggest that my colleagues are incompetent. They are quite far from it. It’s just that nobody has knowledge of everything, and a singular few have the time to gain experience with certain technologies. While I love the process of working together with people to create something, I occasionally find myself quite appreciative of working with people who can ask just the right questions.
I recently found myself expositing on some of the more esoteric vagaries of distributed computing. I was pressed with a question so incisive that it served to distinguish a piece of knowledge that I understand so implicitly that I rarely think to point it out. The question was this:
Why is code and handling data in a distributed environment any different than writing threaded code? It’s just concurrency, right?
It turns out that it’s anything but that simple. However, that’s a really good way to understand it. Most programmers that are worth their salt are familiar with threading and the concurrency primitives that are used with threads. Unfortunately, this knowledge is often informal. It turns out that these formalisms and the rules we follow when working with threading concurrency are not usually understood deeply.
As such, this question was a really good way to bring out the assumptions underlying concurrent programming in a single system; to illustrate why those assumptions don’t carry over to the distributed setting; and to provide context to illustrate the value (and limitations) of distributed consensus.
Quick Survey of Threading Assumptions and Primitives
The first programs that most people write are single-threaded. To wit, when run, the computer starts at some entry point, progresses through a sequence of instructions in some order, and may eventually terminate. As we eventually learned, the operating system (henceforth, OS) is itself a program that passes control to our program. For those of us that started in an older time, the OS often gave up control of the machine entirely (DOS, I’m looking at you). This made it clear that there was some sort of interface between the OS and our program.
In more modern times, the OS appeared to run simultaneously with our program. Knowing that our computers only had one processor led us to learn that the OS had mechanisms to temporarily switch between itself and different programs (and thus multitasking was born). Eventually it really did get multiple processors (and later multiple cores in a single processor). That’s the level of sophistication at which we find ourselves. Each level has stripped away a set of assumptions.
- Single-Threaded Program: No two things happen at once, everything proceeds in an order that we control.
- Multitasking OS: The OS may do things behind our back, multiple copies of our program may be running, the program may be occasionally be scheduled away. Things like file-locks start showing up.
- Multi-Threaded Program on a Single Core: Different Threads may be scheduled off at certain times, so we may use in-memory locks, though certain types of contention aren’t seen as only one thread is really executing at once. Locks implemented in the OS (in software) may show up.
- Multi-Threaded Program on Multiple Cores: Now we can really muck things up, as two bits of code on different cores can simultaneously battle over data. Locks must be implemented in hardware (i.e. a multi-core consistent atomic-exchange or test-and-set instruction of some sort), generally with some interface through the OS. The increase in accessible concurrency leads to a focus on things like locking regimes and deadlock detection.
Inverting the above list, there are various assumptions about concurrency that we shed. Interestingly, most people assume that the issue of distributed computing has to do with concurrency with the addition of greater latency. It turns out that there is no more concurrency in the network model and latency is mostly a performance issue (not a data corruption issue).
Understanding the above is also hindered by two more factors. Firstly, many practitioners today never used a non-multitasking OS. Secondly, most people’s learning about concurrency issues comes in the form of ad-hoc “hack on it until it works” experimentation. The combination of the two usually results in a lot more superstition than information.
Without experiencing a deeply single-threaded environment, you don’t have the contrast to appreciate where parallelism is really happening. Without formal background, you don’t really have any hard guarantees when things do work. Together, these create a situation where programmers work less on hard data and more on perceptions of how risky certain things are.
The way they distinguish and weigh the risks of concurrent code are imprecise and often out of balance. You know that guy who calls all mysterious bugs as “race conditions”? Yeah, that’s the one. These people are not stupid, but learning from experience is not always possible without the right tools.
A Detour Into Logic
Before we dig more into issues unique to distributed computing, there is an incredibly useful concept to know. As anybody familiar with digital encoding will ramble about at length, Information Theory, Computing, and Thermodynamics are tied together in a pretty serious way. Specifically, information is the opposite of entropy, and the transfer of information behaves in a slippery way that is determined by deep mathematics. It’s difficult to express exactly what they’re getting at without a lot of math. However, we don’t need to prove it, or even really understand it. We just need tools to work with it.
As it turns out, there is a confluence between philosophy, logic, and circuit design. Specifically, Nuel Belnap developed something that has been called Relevance Logic (if you can find it, read “A useful four valued logic.”). The upshot of relevance logic is that you can do useful things with a system in which a logical proposition isn’t viewed as simply being True or False. Rather a value can have one of the four states Definitely True, Definitely False, Conflicting Information, or No Information.
If you squint, these concepts are suspiciously close to values used when designing circuits from boolean logic using Karnaugh Maps. In those maps, you have blank spots that may have any value (i.e. ’don’t know’), and you have blank spots in that don’t matter (i.e. ’don’t care’). The former must be solved in order to have a valid circuit (i.e. a buffer value can be provided to prevent “race conditions” where a value flips state in a way that is harmful), whereas the latter are of no consequence to coming up with a solution (and thus don’t matter in the design).
Similarly, you find the same logic in some recent schemes for composing access policies for security systems. When policies are combined, you may have a situation where no policy specifically permits or denies a certain action (i.e. No Information), or two policies provide conflicting permissions (i.e. Conflict).
You can also use the same logic values to describe the various states of eventually-consistent systems. In fact, technologies like vector clocks exist to preserve the extra value in these logics.
In database systems, you find something called the Closed-World Assumption. Namely, that everything in the database is True and everything else is False. Interestingly, this is unworkable enough that databases include the NULL value just to represent No Information, although the semantics are muddy enough that the use of False and NULL can be a contentious point during system design. The Open-World Assumption (which is most popular among the Semantic Web folks) allows for much better handling of incomplete, old, or conflicting values.
In sum, there’s something pervasive going on. If you feel that the universe is perpetuating a conspiracy against you, I can confirm it. There’s something to figure out here. Now, on to the problem at hand.
Information About Failures
When operating on a single computer, even with multiple cores, there are certain things you don’t have to worry about. For example, if a core fails, usually the whole computer blows up. As such, you don’t really need to worry about something blowing up, as you won’t be there to see it. On the other hand, if another program that you are working with disappears, you can detect it with absolute certainly. Put concretely, on a single system it is possible to determine if something has failed gracefully, and when things explode badly you don’t have to worry about it. Expressed in logical terms:
alive?(something_local) → Definitely True | Definitely False
When you’re concerned with something in a distributed system, you don’t have this luxury. In the best case, you get:
alive?(something_distributed) → Definitely True | No Information
This is the core of the problem. Most distributed algorithms spend their time dealing with this. They do this by using tools like causal relationships, monotonic clocks, and real time. Each approach comes at the problem from a different angle.
Networks, Partitions, and Time
When discussing distributed programs, you often hear things expressed in terms of crashing processes and partitioned networked.
You might ask why I use the the term “partitioned network”. This is a very precise way of implying the difference I noted above with respect to the information we get. In a network, there’s no way to tell the difference between a machine failure and a communications failure. More importantly for the purposes of understanding distributed computing, defining the problem in terms of a partitioned network gives the people working with theorems a tractable way to represent all of the possible states of the system (as you can only partition a network in so many ways).
Similarly, speaking about crashed processes let us explicitly define the other side of the missing-communications coin. While we don’t usually have to deal with crashed processes on a single machine (as the whole machine dies), now a lack of a response could be a crash (rather than a partition).
You also don’t get any guarantees about how late a communication can come. Even the simplest useful state machine (one with two states) can be in an infinite number of states when a network and retry-behavior are involved. Notably, partitions and crashes are also unrelated to this phenomenon.
While these formalisms aren’t that useful to us, just remember that the point is that lack of a message gives little concrete information by itself. These terms let us concretely quantify exactly the types of missing information that we are dealing with.
You Can’t Have It All
This brings us to the CAP Theorem, which is itself a specific instance of The Consensus Problem. Any time you have a distributed group of actors that have to agree on some piece of information, it rears its ugly head.
Solutions to Distributed Consensus all have mathematically certain boundaries in just what you can achieve. You can thank Floyd, Lynch, and Paterson for the eponymous FLP impossibility result. It lets us know that it’s impossible (in a very strong sense) for any algorithm to solve the consensus problem in bounded time. It’s not to say that consensus can’t be achieved, but you can’t get a guarantee, your mileage may vary, and there’s no such thing as a free lunch.
The biggest thing to come out of these solutions is the CAP Theorem, which states that no distributed system can provide more than two of the following properties:
- Consistency: The data being managed by some system appears the same everywhere at a given point in time.
- Availability: The system can make changes to the data.
- Partitioning: The system can tolerate a partition (i.e. a split) between its components.
So, iterating the permutations:
- Data can be consistent and available, but the system can’t tolerate a network split.
- Data can be available and the system can tolerate a network split, but it may not look the same in multiple places at the same time.
- Data can be consistent and the system can tolerate a network split, but the system won’t be available during the split.
There was a time when people actually sold #1. Typically it was sold with very expensive networking gear to guarantee that partitions did not occur. The current craze in eventual consistency embraces #2. The systems embodied by #3 are available, but it turns out that few people realize their significance or the issues required to use them. Notably, special handling is required to take advantage of either of the latter two options. Despite this, most people don’t really understand them.
Batteries Not Included
This is the ironic bit. Most people that haven’t scaled very much don’t realize that there work to be done when recovering distributed transactions across relational (usually SQL) databases. Eventually-Consistent types usually know about the recovery they need to do, but rarely implement anything before their first conflict, which is often never if they don’t scale very much.
In both cases, you utilize causality to solve the problem, but it’s interesting to note that most people have never actually had to do it. This has led to a number of common misconceptions about exactly what “comes for free” and what “requires work”.
When done well, an architect will attempt to design things such that you know which actor or prior event caused a certain event, and use this to resolve or prevent conflicts. Put another way, since the only information you can get is Definitely True, you rely on this fact to constructively create a history rather than trying to fill in the gaps with assumed values.
In eventually-consistent systems, this special handling takes the form of Conflict Resolution. Specifically, you get a complete history of who has handled a value. Then, depending on your application, you can handle the conflict appropriately. For example, if the data is a “shopping cart”, you might just merge the two conflicting views of the cart. If the data is a user-password, you might just pick the one that was updated last. If it was a temperature reading on a server, you might just arbitrarily pick one (since it’s going to get updated in a bit anyways).
In this way, the eventually consistent system can resolve Conflict states. That said, it gives no tools for dealing with No Information states. Specifically, you can’t ever tell if a value is current. A newer value may exist somewhere else, and you can only resolve it in post (usually by allowing it to grow from No Information into a Conflict).
Vector clocks give you information about the “causality ordering” of mutations to the shared data. They keep track of who last modified a piece of data and a counter that helps sequence how they relate. As such, you don’t ration out causality (which is a must for availability). Rather like a politician, you instate a bureaucracy to document any mistakes and wait to brush it up until somebody notices.
This recovery strategy trades the certainty of a single world view for availability in the face of network outages (and does so using application-specific knowledge).
In most relational databases, you have transactions. Transactions require some handling by a concerned client to ensure that they aren’t repeated and are definitely committed. If you actually care about this (i.e. you are a bank, Wall Street trading firm, or operate at extremely high scale) you do that “two-phase commit” you may have heard about. While most database programmers are familiar with BEGIN and COMMIT, very few know about PREPARE. It’s also unhelpful that many databases have “prepared query” functionality that has nothing to do with this.
With a carefully handled transaction, you BEGIN, do some work, PREPARE, then COMMIT PREPARED. This prepare phase gets a guarantee from the server that a COMMIT PREPARED will succeed. This lets you then COMMIT PREPARED to multiple databases and get transactional semantics across multiple databases.
While this is all well and good, very few people have ever actually had to deal with what happens the case of a distributed failure. Ostensibly, you are supposed to PREPARE the transaction for all of the databases first, then you COMMIT PREPARED them in a second round. If a failure occurs before they are all successfully prepared, you can roll back the ones that were prepared. Otherwise you can commit them.
Unfortunately, this creates a sort of half-committed limbo that can impact availability. Sometimes other transactions can’t be committed until these transactions have (depending on the serialization level, database implementation, and phase of the moon). These subsequent transactions either fail repeatedly (i.e. aborted when you try to prepare/commit them) or hang waiting on the commit.
In this way, you preserve enough information to prevent events from happening out of order by rationing out events through this process. This rationing causes pauses when there is No Information, and allows you to prevent ever reaching a Conflict state.
This recovery strategy trades the availability of being able to take an update without full communication for the ability to recover completely failed systems in a way that always shows the same world view. The two phases of this commit give you control of the “causality ordering” of mutations to the data by allowing you to guarantee that something either happens or it doesn’t.
AAA versus D
These two approaches have a lot more in common that it first appears. First of all, they both can tolerate a network partition. Secondly, they both preserve information about the order that things happen to resolve conflicts. Thirdly, they both require work from the user to guarantee what consistency they provide. Most subtly, neither of them involves any information about real-time (they are asynchronous in distributed parlance).
That last one is a biggie. Notice that they don’t care when something happens, only the order in which it happens. When you start caring about some time value (often called “wall time”, as in the clock on the wall), things get complicated in ways that are difficult to analyze.
The ways that they differ are also telling. Eventual consistency lets things get mucked up and out of order, but lets you resolve it. Strong Consistency preserves what’s needed to keep things in order, possibly at the expense of other transactions that are running. In both cases, they care about causality very explicitly. Again, causality is the one and only tool you have to keep things consistent asynchronously.
So, what’s so bad about introducing a “wall clock”? Well, it’s just fraught with peril, as systems can have strongly nondeterministic behavior if they act on wall clock timing when there is significant clock skew. For example, let’s say that you try to resolve conflicts by picking the one that happened “later”. If one system’s clock is “slow”, then another system may always consider its updates so old that it doesn’t use them. Similarly, if these clock values are persisted, running a system with an impossibly fast clock (i.e. it thinks it’s the year 3000) will yield values that are always “newer” than legitimate updates.
Even if you permit a certain value for skew, these systems aren’t usable when the legitimate delay gets too large. For example, if you were to FTP a file to Mars, using the default recommended timeouts for FTP and TCP, you would only be able to complete a connection for about two-thirds of the time (and worse cases would take almost five hours before the transfer would even really start). This isn’t even considering the fun that attackers can have playing with clocks.
In general, determinism gets buys you something. Specifically, if both systems are deterministic, then you can predict what the other guy is going to do. Wall clocks lose this property, so you end up requiring some communication to keep things in sync—which is often the very thing the wall clock is being used to prevent.
You may have noticed that I keep saying that causality is the only tool that you have to work with. There’s a good reason. You can’t count on many things about information you receive, but one thing you can count on is that something can’t be known about until it exists. If you hear about some value, it must have existed at some point before you make that claim. This means that building systems where one piece of information refers to the information that “caused” it, then you lock this information into your system.
Vector clocks do this by keeping a list of previous versions. Two-Phase Commit creates two phases, one depending on the other, to guarantee that the system doesn’t get changed unless it happens everywhere. In both cases, the implication that one thing causes the next is what preserves consistency (or at least enables it to be recovered).
So, the problem of writing distributed programs is not just concurrency or latency, but rather the fact that you have less information to work with. You can’t ever have global information about which nodes have failed or the state that they might be in. You can enforce global state through strict causality or resolve a global state using conflict resolution. With care, your system can deal with all four values of the logic, but different implementations have different strengths and weaknesses.
Getting back to our original premise, how does this compare to our experience writing local, threaded programs? Here are the chief differences:
- at best, an eventually-consistent distributed read/write is similar to a volatile local read/write, but you keep enough information to try to resolve any conflicts and no way to guarantee recency in reads
- at best, a strictly-consistent distributed read/write is similar to a locked local read/write, with all of the potential delays that a lock may have amplified by the network latency
- crashes can’t be usually distinguished from communications failures
- failures (and thus crashes) don’t occur in such a way that you can assume that all actors die at once
- solutions that involve real time introduce the nondeterminism of networking into your application’s behavior, often requiring more communication due to the lack of predictable behavior
- causality is about the only tool that can provide any guarantees when distributed
As you can see, most of these problems aren’t strictly new, but they certainly are made worse in the distributed environment. How does this fit together? The easiest way is to analyze the different types of failures we can encounter:
- crashes: undetectable globally
- communications failures: indistinguishable from crashes
- clock skew: confuses real-time-based systems, potentially corrupting data
- clock failure: can cause real-time systems to fail to take corrective action
This is pretty different from what most people are used to. None of these issues are as difficult (or in some cases even exist) on a single machine.
If you ever have trouble with distributed code, just look around for these assumptions. When a decision is made from a lack of communication it’s a red flag. When one value arbitrarily overwrites another, it’s also a red flag.
Just remember, information is rarely created, hard to preserve, and easily destroyed. Find out where your information is leaking and you’ll find the hole in your implementation. Just cross your fingers and hope it’s not too deep in the design.
Concurrency is not the problem. Distributed code comes with new exciting issues in addition to concurrency. There are solutions these problems, though each comes with its own baggage.
I certainly don’t mean to scare people away from distributed systems. But I would caution that proving rigorously that a system with an infinite number of states is safe in all of them probably requires at least a little formal analysis.
If you’re interested in more reading, I’d recommend contrasting Lamport’s Paxos with Amazon’s Dynamo. Pretty much everything you need to know about distributed consensus comes up between the two. Most importantly, have a lot of fun!