My Writings. My Thoughts.

Sublime Text Diagram Plugin 2.1

// July 11th, 2012 // 2 Comments » // Uncategorized

For all of you Sublime Text lovers out there, here’s an updated version of my SublimeText Diagram Plugin.  Highlights of this release include:

  • Updated PlantUML version.
  • Better installation instructions.
  • Initial support for Linux.
  • Much better error messages.
  • Works correctly when run with optimization mode enabled.
  • Command registered for Command Panel.

The release is over on GitHub: https://github.com/jvantuyl/sublime_diagram_plugin/tree/2.1

MICROFACT: How To Create An Empty Git Root Commit

// July 11th, 2012 // 3 Comments » // Uncategorized

I’ve been doing a lot of tooling and implementation with the wonderful Git revision control system.  In my time, I’ve needed to do some magic things involving repository history.  One thing that comes up often is the need to have an empty commit history on some branch.

Continue Reading

Diagramming Plug-In for Sublime Text 2

// September 12th, 2011 // Comments Off // Uncategorized

For those of those who have tried it, the minimalistic but powerful experience that is Sublime Text 2 is fairly nice.

One of the things that is missing (for me at least) is integration for my diagramming tools. I make pretty extensive use of PlantUML, GraphViz, and DITAA for diagrams. I have a strong belief that, while pictures are worth 1000 words, it is not always worth the trouble to draw them by hand. While Visio has sold many concepts, I like to be able to embed diagrams pervasively in my code, track their development with revision control, and ultimately be able to modify them with some powerful (invariably text-based) tool.

PlantUML has a lovely Eclipse plugin. It makes a nice little view that sits neatly next to your editor. When your selection tracks into a diagram, commented or otherwise, it dutifully renders it for you.

Sublime Text has not yet reached the point where this level of graphical integration is possible. However, you can have my diagrams when you pry them from my cold, dead hand. As a stopgap measure, I’ve written a little plugin for Sublime Text that will find diagrams that overlap your selection, render them, and them display them in a handy Quick Look menu. As you might have guessed, this is currently Mac only (though I’ll gladly accept patches). While it only supports PlantUML right now, it’s capable of being easily extended for other diagram types.

You can find it on GitHub. Or, if you prefer to just have the git URL: git://github.com/jvantuyl/sublime_diagram_plugin.git

I hope it’s helpful. To install it, you must have Java installed on your Mac. Then just check it out into ~/Library/Application Support/Sublime Text 2/Packages/Diagram (i.e. that should be the root of the checkout). Upon loading Sublime, just check your console to see that it detected the version of PlantUML as it should have. If so, you should be able to select any drawings (rather undecorated or in a comment) and press Cmd+M to render them (this shortcut is up for discussion, I just picked one that was open on my system).

I’ve only tested it on Lion, but I think it should work on Leopard and Snow Leopard as well (all it needs from MacOS is the qlmanage command).

I hope it’s helpful!

ErlCtl 0.3 Released

// December 4th, 2010 // 2 Comments » // Uncategorized

This is mostly an incremental update, but it does fix a few big issues:

  • Added a verbose mode for debugging use with -v.
  • Fixed a situation where the client hostname was computed but didn't match what the local networking would choose (thus breaking communication in certain environments).
  • Catch more errors in CLI code.

I'd really appreciate it if people can see if this breaks DNS usage on any of their hosts.   It seems solid to me, but I only have so many places to test it out.

You can find it in the usual place.

Thanks!

Are Contributor Agreements Detrimental To Open Source?

// August 31st, 2010 // Comments Off // Uncategorized

What Prompted This?

I recently have been reading a lot of Simon Phipps’ blog posts. They are very good and I highly recommend adding him to your RSS reader if you care about technical issues, especially those revolving around how it interacts with businesses.

He just wrote a very excellent article entitled Should Open Source Communities Avoid Contributor Agreements?. I don’t particularly agree with him and I thought I’d dissent over here largely because I want to see the software that I use continue to be Open Source.

Why Do You Write Open Source?

Having participated in (and actually initiated some) projects that require copyright assignment, it’s really a non-issue. At least, usually it is. The question is largely based around why you write Open Source software.

There are various camps, but you can generally summarize the views into these prototypical categories:

  • Code-Lovers: I like for people to use my software.
  • Collaborationists: I think that trading code makes software more valuable.
  • Tinkerers: I like to take apart things and Open Source is a social contract to enable that (i.e. The Right To Tinker).
  • “First Sale Doctrine” Proponents: If I buy it, it should be mine to break as I please. Open Source ensures this.
  • Copyleft Fundamentalists: Reverse Engineering is a civil right. Open Source is an economic instrument to force you to bend to my will.

Interestingly, all but the last group tend to be ambivalent about contributor agreements. In particular, only the fundamentalists are typically against Contributor Agreements on any grounds other than economics (i.e. do they increase or decrease the number of projects open sourced).

Those Nuts From Berkeley

A complete picture of Open Source history should include the BSD License people. Many people still won’t license under anything else precisely because of the reasons that people seek copyright assignment.

Have their contributions been worth it? Yes.

Have their communities been significantly less functional because of it? Not really.

How do they feel about contributor agreements? In general, they either will sign one or just release their patches anyways. They’ll never try to coerce you not to sign one because, as far as their concerned, what’s theirs belongs to them and what’s yours belongs to you.

Am I A GPL-Hating Tea Partier?

I love the GPL most above all licenses. That said, I’ve released two projects under the LGPL with Copyright Assignment. Consequently, I helped funnel literally tens of millions of dollars into Open Source projects precisely because some Venture Capitalists would allow me to do so if they could get their payday. This has been and continues to be a good thing for Open Source. Let them be our Medici’s, we can continue our paintings.

Rather than railing about community inequality, I would propose that people examine perhaps the biggest benefit of assignment. Without a single legal copyright owner, it is very difficult for any company to drive a project. You have no idea how many people won’t buy a product from you because you are not the neck to wring.

Is Open Source about business? Well, the answer to that largely appears to depend on how hungry a person’s family is. Lack of food has an amazing chilling effect on Open Source development.

To the extent that Open Source is about making a living, the business equation looms large. While employment certainly might appear to be a “glass ceiling” for contributors, others see it as being a job opportunity. It’s certainly beneficial to those who start the project to be able to get a job doing it.

It’s also beneficial to the users to be able to employ those people. It’s beneficial to the other users that they don’t have to. Pretty much everybody wins unless the founder’s benefactor chooses to force them to do “bad things”. Ironically, the founder could have done that on their own, so this doesn’t change the odds much. Conversely, that pool of other users gives them the founder lot of employment opportunities, effectively preventing much of this coercion.

To the extent that Open Source is about keeping the source open for the users, copyright assignment doesn’t do that much. At most, there’s a fear that someone will “close up” a project. Somehow I don’t think it’ll kill Java, Solaris, or any of the others. I’d be more worried about getting a perpetual patent license than assigning copyright.

Meritocracy

I find it ironic that Mr. Phipps suggest that Open Source projects are supposed to be a meritocracy run by people who have earned it. The belief that they are participating in a meritocracy might be the way that some of my Asperger’s-afflicted colleagues make it through the night without crying themselves to sleep. I doubt that you’re going to find a political scientist that would call most Open Source projects a meritocracy.

Even in the most open projects, meritocracy often is shaped more like commit-bit-cracy. Regardless of how we may want it to be, some collective of people will be the ones that control the SCM, the website, et cetera. A business is about the best way to organize and drive that group. Anything less gets even more political in a way that inevitably costs users and developers. Consider how much innovation you’ve seen in GNU libc lately, for example.

Or Debian itself? Where would it be without Canonical’s Ubuntu? Oh, they have a contributor agreement by the way. Even though they don’t have an agreement, GNU asks you to assign copyright to the FSF (which ironically doesn’t have money to enforce them all).

Those projects that are meritocracies are largely that way precisely because of the control exercised by the group who would ask for that contributor agreement. The Mozilla project that he points out is exactly that.

Meritocracies are rare because it’s not merit that creates them—it’s control. Believing you can build a meritocracy without some control structure is akin to believing your girlfriend will stay with you just because you’re a good person. It takes effort, a plan, and care to make a relationship, development team, or software community functional.

But Is It Fair?

The million dollar question is “Why does starting a project give you more credit?” I’ve been on the down-side of this particular transaction. So has anyone that tried to improve XFree86 or OpenSolaris or any other “half-open” project. I didn’t enjoy those experiences but there wasn’t a moment that a Contributor Agreement would have stopped me.

Fear of these agreements doesn’t stop users, they stop coders. Even if they do stop coders from being users, it’s certainly not a majority of them that it does. Being a user is a prerequisite for having a community, in general, and for being a contributor, in particular. That makes this a lot like worrying about your cheap speakers when you have a cheap amplifier, or worrying about performance tires when you’re driving a Kia.

But What If They Close It Up?

A company typically gets to do this exactly once. Once the trust is gone, it’s gone. On the other hand, we get everything they contributed along the way. Presumably that had to be something useful for it to be worth contributing to in the first place. Are people contributing to SCO very much these days?

I could point out any number of examples, but I think this one is pretty clear. Is it a gamble? Sure. Has it gotten us access to some pretty good software? Yes.

Do you really think that OpenJDK wasn’t worth it because of the Sun Community Contributor Agreement? Has it being Open Sourced been better for everyone? Has the agreement materially kept the software from getting better? Would cutting all of that off have been worth it just because we knew that Oracle was going to pull what it’s pulling?

In the end, any entity with copyright assignment has a vested interest in guaranteeing that a fork doesn’t exist with improvements that they can’t use—as it makes that copyright worth less. As such, the fear of a fork gives the creators more incentive to play ball, not less. I don’t think I’ve ever seen a project that had a healthy fork not have a healthy community. Either it’s healthy because the company cares, or it’s healthy because the users forked it (which tends to make people cohesive).

The Moral Of The Story

Mr. Phipps’ point seems to be that it’s an obstacle to people adopting and continuing to develop projects as a community. I propose that getting people to start projects as Open Source is more important than a little boost to projects that are already there. Especially a boost that decreases the incentive to fork, thereby making Open Source organizations less stable.

Just like dropping bombs on people tends to make them terrorists, I’d suggest that removing the payday from starting projects makes closed sourcerors. I don’t think that’s worth it, at all.

Instead, we should focus on getting sane patent licensing, for example.

Distributed Dragons: Why Distributed Algorithms Are Different

// August 29th, 2010 // Comments Off // Uncategorized

The Perfect Question

For anyone who has ever tried to teach, you rapidly learn that the chief problem to learning is not failing to understand answers, but failing to ask the right questions. The best students usually have an intuition of how questions dovetail together, a creativity it formulating them, and a general talent for finding the right way to distinguish subtle differences. Teachers often love these students, though their classmates might not appreciate them so much.

As a professional developer, I often find myself in the role of teacher. Not to suggest that my colleagues are incompetent. They are quite far from it. It’s just that nobody has knowledge of everything, and a singular few have the time to gain experience with certain technologies. While I love the process of working together with people to create something, I occasionally find myself quite appreciative of working with people who can ask just the right questions.

I recently found myself expositing on some of the more esoteric vagaries of distributed computing. I was pressed with a question so incisive that it served to distinguish a piece of knowledge that I understand so implicitly that I rarely think to point it out. The question was this:

Why is code and handling data in a distributed environment any different than writing threaded code? It’s just concurrency, right?

It turns out that it’s anything but that simple. However, that’s a really good way to understand it. Most programmers that are worth their salt are familiar with threading and the concurrency primitives that are used with threads. Unfortunately, this knowledge is often informal. It turns out that these formalisms and the rules we follow when working with threading concurrency are not usually understood deeply.

As such, this question was a really good way to bring out the assumptions underlying concurrent programming in a single system; to illustrate why those assumptions don’t carry over to the distributed setting; and to provide context to illustrate the value (and limitations) of distributed consensus.

Quick Survey of Threading Assumptions and Primitives

The first programs that most people write are single-threaded. To wit, when run, the computer starts at some entry point, progresses through a sequence of instructions in some order, and may eventually terminate. As we eventually learned, the operating system (henceforth, OS) is itself a program that passes control to our program. For those of us that started in an older time, the OS often gave up control of the machine entirely (DOS, I’m looking at you). This made it clear that there was some sort of interface between the OS and our program.

In more modern times, the OS appeared to run simultaneously with our program. Knowing that our computers only had one processor led us to learn that the OS had mechanisms to temporarily switch between itself and different programs (and thus multitasking was born). Eventually it really did get multiple processors (and later multiple cores in a single processor). That’s the level of sophistication at which we find ourselves. Each level has stripped away a set of assumptions.

  • Single-Threaded Program: No two things happen at once, everything proceeds in an order that we control.
  • Multitasking OS: The OS may do things behind our back, multiple copies of our program may be running, the program may be occasionally be scheduled away. Things like file-locks start showing up.
  • Multi-Threaded Program on a Single Core: Different Threads may be scheduled off at certain times, so we may use in-memory locks, though certain types of contention aren’t seen as only one thread is really executing at once. Locks implemented in the OS (in software) may show up.
  • Multi-Threaded Program on Multiple Cores: Now we can really muck things up, as two bits of code on different cores can simultaneously battle over data. Locks must be implemented in hardware (i.e. a multi-core consistent atomic-exchange or test-and-set instruction of some sort), generally with some interface through the OS. The increase in accessible concurrency leads to a focus on things like locking regimes and deadlock detection.

Inverting the above list, there are various assumptions about concurrency that we shed. Interestingly, most people assume that the issue of distributed computing has to do with concurrency with the addition of greater latency. It turns out that there is no more concurrency in the network model and latency is mostly a performance issue (not a data corruption issue).

Understanding the above is also hindered by two more factors. Firstly, many practitioners today never used a non-multitasking OS. Secondly, most people’s learning about concurrency issues comes in the form of ad-hoc “hack on it until it works” experimentation. The combination of the two usually results in a lot more superstition than information.

Without experiencing a deeply single-threaded environment, you don’t have the contrast to appreciate where parallelism is really happening. Without formal background, you don’t really have any hard guarantees when things do work. Together, these create a situation where programmers work less on hard data and more on perceptions of how risky certain things are.

The way they distinguish and weigh the risks of concurrent code are imprecise and often out of balance. You know that guy who calls all mysterious bugs as “race conditions”? Yeah, that’s the one. These people are not stupid, but learning from experience is not always possible without the right tools.

A Detour Into Logic

Before we dig more into issues unique to distributed computing, there is an incredibly useful concept to know. As anybody familiar with digital encoding will ramble about at length, Information Theory, Computing, and Thermodynamics are tied together in a pretty serious way. Specifically, information is the opposite of entropy, and the transfer of information behaves in a slippery way that is determined by deep mathematics. It’s difficult to express exactly what they’re getting at without a lot of math. However, we don’t need to prove it, or even really understand it. We just need tools to work with it.

As it turns out, there is a confluence between philosophy, logic, and circuit design. Specifically, Nuel Belnap developed something that has been called Relevance Logic (if you can find it, read “A useful four valued logic.”). The upshot of relevance logic is that you can do useful things with a system in which a logical proposition isn’t viewed as simply being True or False. Rather a value can have one of the four states Definitely True, Definitely False, Conflicting Information, or No Information.

If you squint, these concepts are suspiciously close to values used when designing circuits from boolean logic using Karnaugh Maps. In those maps, you have blank spots that may have any value (i.e. ’don’t know’), and you have blank spots in that don’t matter (i.e. ’don’t care’). The former must be solved in order to have a valid circuit (i.e. a buffer value can be provided to prevent “race conditions” where a value flips state in a way that is harmful), whereas the latter are of no consequence to coming up with a solution (and thus don’t matter in the design).

Similarly, you find the same logic in some recent schemes for composing access policies for security systems. When policies are combined, you may have a situation where no policy specifically permits or denies a certain action (i.e. No Information), or two policies provide conflicting permissions (i.e. Conflict).

You can also use the same logic values to describe the various states of eventually-consistent systems. In fact, technologies like vector clocks exist to preserve the extra value in these logics.

In database systems, you find something called the Closed-World Assumption. Namely, that everything in the database is True and everything else is False. Interestingly, this is unworkable enough that databases include the NULL value just to represent No Information, although the semantics are muddy enough that the use of False and NULL can be a contentious point during system design. The Open-World Assumption (which is most popular among the Semantic Web folks) allows for much better handling of incomplete, old, or conflicting values.

In sum, there’s something pervasive going on. If you feel that the universe is perpetuating a conspiracy against you, I can confirm it. There’s something to figure out here. Now, on to the problem at hand.

Information About Failures

When operating on a single computer, even with multiple cores, there are certain things you don’t have to worry about. For example, if a core fails, usually the whole computer blows up. As such, you don’t really need to worry about something blowing up, as you won’t be there to see it. On the other hand, if another program that you are working with disappears, you can detect it with absolute certainly. Put concretely, on a single system it is possible to determine if something has failed gracefully, and when things explode badly you don’t have to worry about it. Expressed in logical terms:

alive?(something_local) → Definitely True | Definitely False

When you’re concerned with something in a distributed system, you don’t have this luxury. In the best case, you get:

alive?(something_distributed) → Definitely True | No Information

This is the core of the problem. Most distributed algorithms spend their time dealing with this. They do this by using tools like causal relationships, monotonic clocks, and real time. Each approach comes at the problem from a different angle.

Networks, Partitions, and Time

When discussing distributed programs, you often hear things expressed in terms of crashing processes and partitioned networked.

You might ask why I use the the term “partitioned network”. This is a very precise way of implying the difference I noted above with respect to the information we get. In a network, there’s no way to tell the difference between a machine failure and a communications failure. More importantly for the purposes of understanding distributed computing, defining the problem in terms of a partitioned network gives the people working with theorems a tractable way to represent all of the possible states of the system (as you can only partition a network in so many ways).

Similarly, speaking about crashed processes let us explicitly define the other side of the missing-communications coin. While we don’t usually have to deal with crashed processes on a single machine (as the whole machine dies), now a lack of a response could be a crash (rather than a partition).

You also don’t get any guarantees about how late a communication can come. Even the simplest useful state machine (one with two states) can be in an infinite number of states when a network and retry-behavior are involved. Notably, partitions and crashes are also unrelated to this phenomenon.

While these formalisms aren’t that useful to us, just remember that the point is that lack of a message gives little concrete information by itself. These terms let us concretely quantify exactly the types of missing information that we are dealing with.

You Can’t Have It All

This brings us to the CAP Theorem, which is itself a specific instance of The Consensus Problem. Any time you have a distributed group of actors that have to agree on some piece of information, it rears its ugly head.

Solutions to Distributed Consensus all have mathematically certain boundaries in just what you can achieve. You can thank Floyd, Lynch, and Paterson for the eponymous FLP impossibility result. It lets us know that it’s impossible (in a very strong sense) for any algorithm to solve the consensus problem in bounded time. It’s not to say that consensus can’t be achieved, but you can’t get a guarantee, your mileage may vary, and there’s no such thing as a free lunch.

The biggest thing to come out of these solutions is the CAP Theorem, which states that no distributed system can provide more than two of the following properties:

  • Consistency: The data being managed by some system appears the same everywhere at a given point in time.
  • Availability: The system can make changes to the data.
  • Partitioning: The system can tolerate a partition (i.e. a split) between its components.

So, iterating the permutations:

  1. Data can be consistent and available, but the system can’t tolerate a network split.
  2. Data can be available and the system can tolerate a network split, but it may not look the same in multiple places at the same time.
  3. Data can be consistent and the system can tolerate a network split, but the system won’t be available during the split.

There was a time when people actually sold #1. Typically it was sold with very expensive networking gear to guarantee that partitions did not occur. The current craze in eventual consistency embraces #2. The systems embodied by #3 are available, but it turns out that few people realize their significance or the issues required to use them. Notably, special handling is required to take advantage of either of the latter two options. Despite this, most people don’t really understand them.

Batteries Not Included

This is the ironic bit. Most people that haven’t scaled very much don’t realize that there work to be done when recovering distributed transactions across relational (usually SQL) databases. Eventually-Consistent types usually know about the recovery they need to do, but rarely implement anything before their first conflict, which is often never if they don’t scale very much.

In both cases, you utilize causality to solve the problem, but it’s interesting to note that most people have never actually had to do it. This has led to a number of common misconceptions about exactly what “comes for free” and what “requires work”.

When done well, an architect will attempt to design things such that you know which actor or prior event caused a certain event, and use this to resolve or prevent conflicts. Put another way, since the only information you can get is Definitely True, you rely on this fact to constructively create a history rather than trying to fill in the gaps with assumed values.

Eventually Consistent

In eventually-consistent systems, this special handling takes the form of Conflict Resolution. Specifically, you get a complete history of who has handled a value. Then, depending on your application, you can handle the conflict appropriately. For example, if the data is a “shopping cart”, you might just merge the two conflicting views of the cart. If the data is a user-password, you might just pick the one that was updated last. If it was a temperature reading on a server, you might just arbitrarily pick one (since it’s going to get updated in a bit anyways).

In this way, the eventually consistent system can resolve Conflict states. That said, it gives no tools for dealing with No Information states. Specifically, you can’t ever tell if a value is current. A newer value may exist somewhere else, and you can only resolve it in post (usually by allowing it to grow from No Information into a Conflict).

Vector clocks give you information about the “causality ordering” of mutations to the shared data. They keep track of who last modified a piece of data and a counter that helps sequence how they relate. As such, you don’t ration out causality (which is a must for availability). Rather like a politician, you instate a bureaucracy to document any mistakes and wait to brush it up until somebody notices.

This recovery strategy trades the certainty of a single world view for availability in the face of network outages (and does so using application-specific knowledge).

Strongly Consistent

In most relational databases, you have transactions. Transactions require some handling by a concerned client to ensure that they aren’t repeated and are definitely committed. If you actually care about this (i.e. you are a bank, Wall Street trading firm, or operate at extremely high scale) you do that “two-phase commit” you may have heard about. While most database programmers are familiar with BEGIN and COMMIT, very few know about PREPARE. It’s also unhelpful that many databases have “prepared query” functionality that has nothing to do with this.

With a carefully handled transaction, you BEGIN, do some work, PREPARE, then COMMIT PREPARED. This prepare phase gets a guarantee from the server that a COMMIT PREPARED will succeed. This lets you then COMMIT PREPARED to multiple databases and get transactional semantics across multiple databases.

While this is all well and good, very few people have ever actually had to deal with what happens the case of a distributed failure. Ostensibly, you are supposed to PREPARE the transaction for all of the databases first, then you COMMIT PREPARED them in a second round. If a failure occurs before they are all successfully prepared, you can roll back the ones that were prepared. Otherwise you can commit them.

Unfortunately, this creates a sort of half-committed limbo that can impact availability. Sometimes other transactions can’t be committed until these transactions have (depending on the serialization level, database implementation, and phase of the moon). These subsequent transactions either fail repeatedly (i.e. aborted when you try to prepare/commit them) or hang waiting on the commit.

In this way, you preserve enough information to prevent events from happening out of order by rationing out events through this process. This rationing causes pauses when there is No Information, and allows you to prevent ever reaching a Conflict state.

This recovery strategy trades the availability of being able to take an update without full communication for the ability to recover completely failed systems in a way that always shows the same world view. The two phases of this commit give you control of the “causality ordering” of mutations to the data by allowing you to guarantee that something either happens or it doesn’t.

AAA versus D

These two approaches have a lot more in common that it first appears. First of all, they both can tolerate a network partition. Secondly, they both preserve information about the order that things happen to resolve conflicts. Thirdly, they both require work from the user to guarantee what consistency they provide. Most subtly, neither of them involves any information about real-time (they are asynchronous in distributed parlance).

That last one is a biggie. Notice that they don’t care when something happens, only the order in which it happens. When you start caring about some time value (often called “wall time”, as in the clock on the wall), things get complicated in ways that are difficult to analyze.

The ways that they differ are also telling. Eventual consistency lets things get mucked up and out of order, but lets you resolve it. Strong Consistency preserves what’s needed to keep things in order, possibly at the expense of other transactions that are running. In both cases, they care about causality very explicitly. Again, causality is the one and only tool you have to keep things consistent asynchronously.

Real Time

So, what’s so bad about introducing a “wall clock”? Well, it’s just fraught with peril, as systems can have strongly nondeterministic behavior if they act on wall clock timing when there is significant clock skew. For example, let’s say that you try to resolve conflicts by picking the one that happened “later”. If one system’s clock is “slow”, then another system may always consider its updates so old that it doesn’t use them. Similarly, if these clock values are persisted, running a system with an impossibly fast clock (i.e. it thinks it’s the year 3000) will yield values that are always “newer” than legitimate updates.

Even if you permit a certain value for skew, these systems aren’t usable when the legitimate delay gets too large. For example, if you were to FTP a file to Mars, using the default recommended timeouts for FTP and TCP, you would only be able to complete a connection for about two-thirds of the time (and worse cases would take almost five hours before the transfer would even really start). This isn’t even considering the fun that attackers can have playing with clocks.

In general, determinism gets buys you something. Specifically, if both systems are deterministic, then you can predict what the other guy is going to do. Wall clocks lose this property, so you end up requiring some communication to keep things in sync—which is often the very thing the wall clock is being used to prevent.

Why Causality?

You may have noticed that I keep saying that causality is the only tool that you have to work with. There’s a good reason. You can’t count on many things about information you receive, but one thing you can count on is that something can’t be known about until it exists. If you hear about some value, it must have existed at some point before you make that claim. This means that building systems where one piece of information refers to the information that “caused” it, then you lock this information into your system.

Vector clocks do this by keeping a list of previous versions. Two-Phase Commit creates two phases, one depending on the other, to guarantee that the system doesn’t get changed unless it happens everywhere. In both cases, the implication that one thing causes the next is what preserves consistency (or at least enables it to be recovered).

The Dragons

So, the problem of writing distributed programs is not just concurrency or latency, but rather the fact that you have less information to work with. You can’t ever have global information about which nodes have failed or the state that they might be in. You can enforce global state through strict causality or resolve a global state using conflict resolution. With care, your system can deal with all four values of the logic, but different implementations have different strengths and weaknesses.

Getting back to our original premise, how does this compare to our experience writing local, threaded programs? Here are the chief differences:

  • at best, an eventually-consistent distributed read/write is similar to a volatile local read/write, but you keep enough information to try to resolve any conflicts and no way to guarantee recency in reads
  • at best, a strictly-consistent distributed read/write is similar to a locked local read/write, with all of the potential delays that a lock may have amplified by the network latency
  • crashes can’t be usually distinguished from communications failures
  • failures (and thus crashes) don’t occur in such a way that you can assume that all actors die at once
  • solutions that involve real time introduce the nondeterminism of networking into your application’s behavior, often requiring more communication due to the lack of predictable behavior
  • causality is about the only tool that can provide any guarantees when distributed

As you can see, most of these problems aren’t strictly new, but they certainly are made worse in the distributed environment. How does this fit together? The easiest way is to analyze the different types of failures we can encounter:

  • crashes: undetectable globally
  • communications failures: indistinguishable from crashes
  • clock skew: confuses real-time-based systems, potentially corrupting data
  • clock failure: can cause real-time systems to fail to take corrective action

This is pretty different from what most people are used to. None of these issues are as difficult (or in some cases even exist) on a single machine.

If you ever have trouble with distributed code, just look around for these assumptions. When a decision is made from a lack of communication it’s a red flag. When one value arbitrarily overwrites another, it’s also a red flag.

Just remember, information is rarely created, hard to preserve, and easily destroyed. Find out where your information is leaking and you’ll find the hole in your implementation. Just cross your fingers and hope it’s not too deep in the design.

Closing

Concurrency is not the problem. Distributed code comes with new exciting issues in addition to concurrency. There are solutions these problems, though each comes with its own baggage.

I certainly don’t mean to scare people away from distributed systems. But I would caution that proving rigorously that a system with an infinite number of states is safe in all of them probably requires at least a little formal analysis.

If you’re interested in more reading, I’d recommend contrasting Lamport’s Paxos with Amazon’s Dynamo. Pretty much everything you need to know about distributed consensus comes up between the two. Most importantly, have a lot of fun!

ErlCtl 0.2 Released

// May 31st, 2010 // Comments Off // Uncategorized

ErlCtl 0.2 is up on GitHub.  Not a huge number of changes, but some definite improvements.  In no particular order:

  • When a CLI module isn't available, gives a reasonable error message.
  • Typos fixed.
  • README and Docs updated.
  • Now works when EPMD is not already running.  (Oops!)
  • Smarter errors when name service is grossly misconfigured (though not perfect yet).

For anyone who's using it, enjoy!

Announcing ErlCtl, 0.1

// February 21st, 2010 // 1 Comment » // Uncategorized

There has always been a need to have a good command-line interface to most system software. While Erlang is great for writing this kind of stuff, it's natural command-line can be less than stellar. If we were left to use the erl command itself, command-lines would be long to type and arduous to construct.

Most projects avoid this problem with the standard Apache idiom of having a control script.

As it so happens, Erlang is a lovely language for this idiom. Rather than lots of trouble managing processes and working with sockets, Distributed Erlang neatly solves these issues of simple security, finding running programs, and generally making it easy for two Erlang programs to interact.

While the major Erlang projects have their own implementations of this (ejabberdctl, rabbitmqctl, riak-admin), there isn't really anything built-in to fill this need.

After a little frustration trying to adapt other code to fit the bill, I just wrote my own for use with my latest super-sekret project. After a little cleanup, here's a generic library for building simple command-line interfaces and an example application that uses it! World, meet ErlCtl. It's young, so, please, be gentle.

How Do I Use It?

How might you use this software, you ask?  Well, it's really quite simple.  After installing erlctl-0.1 into your Erlang lib_dir (not automated yet), follow these two easy steps.

First, somewhere in your command path, make a symlink from the erlctl binary (in erlctl-0.1/bin) to the name of your control program.  For this example, let's say you have an application named awesome.  In this case, you'd link the program to awesomectl (or, alternately awesome_ctl).

Second, in your application, create a module based on the same name.  In this case, you would create awesome_cli.erl.  This module should export functions (which I call "commands") that take two arguments.  The first argument is the "context" within which that command runs, and the second is the list of arguments for that function.

Commands For Any Season

For example, let's say you have a command that should print out the version of your application.  If you define that "version" command with the "always" context, then it will simply run your code.  For example:

 


version(always,[]) -> {ok,"Version: ~p",[0.1]}.

 

This simple function will return a successful return code and print out the version number.  Presto, instant CLI!  You can run this a command line like "awesomectl version".  It's that easy!

Starting Things Up

Let's say that you need to start a server process.  If you provide a command in the "not_running" context, it can start a server as easily as it can return the atom "start".

 


start(not_running,[]) -> {start,[],"Starting."};

 

It doesn't have to stop there.  Need to run some code after starting the server?  Create the same command with the "started" context, and it will be run in the started server.

 


start(started,[]) ->
  ok = application:start(ctl_ex),
  {ok,"Started!"}.

 

Notice that the first command decided to start a new server, the server started, and the second command took over inside of the newly started server?  Neat, huh!  You would run this with the simple command line of "awesomectl start".

It's Like An "Erlang Endoscope"™

Let's say your server is happily running along, but you need to monkey about with its internals.  The "running" context will cause the command to happily run inside of the server with no messy work required!

 


list_users(running,[]) ->
  {ok,Users} = ctl_ex:list_users(),
  format("Users:~n"),
  case Users of
    [] ->
      io:format(" <no users>~n");
    _ ->
      lists:foreach(
        fun (User) ->
          format(" ~s~n",[User])
        end,
        Users
      )
  end,
  ok.
 

Notice how there are a few useful commands to help out?  The format command transparently handles printing out stuff at the client side, without worrying about where your code is actually running.  That's location-transparency that saves you time!  As you might have imagined, the command here is "awesomectl list_users".

What's even better is that commands may implement any combination of these contexts as they see fit.  So you can give intelligent errors when a server isn't running, or implement actual functionality when it is!

A Fully ARMED and OPERATIONAL Example Application

Is this a bit too much to take in at once?  Why not take a look at the example application I wrote for you.  It's a simple application, with a simple gen_server that keeps track of a list of users.  A full CLI is provided for it in this single file.

How Erlang Makes Event-Driven Code Easy

// January 15th, 2010 // 2 Comments » // Uncategorized

I don’t think it’s a secret that event-driven programming is pretty useful. One of the reasons that I’ve found Erlang so attractive is that it allows you to write code that behaves like evented code, but still feels like imperative code.

To put this in perspective, take a look at Twisted for Python. In order to make a function that waits for something else to occur, you have to split your code into callbacks and then attach them to a deferred object.

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.python.util import println

def display(page):
  println(page)
  reactor.stop()

getPage("http://www.google.com/").addCallbacks(display)

reactor.run()

In a full-blown Twisted application, you usually are just implementing the callbacks and you pass them to parts of the system that fill in the blanks. It’s quite powerful, but it’s kind of a pain because you end up breaking up your code into these little pieces. The control flow of you program is completely sliced and diced.

Similarly, Event Machine for Ruby has an identical problem.

require 'rubygems'
require 'eventmachine'
require 'pp'

EventMachine::run do
  conn = EM::Protocols::HttpClient2.connect "www.google.com",80
  request = conn.get '/'

  request.callback do |r|
    pp r.content
    EM.stop
  end
end

As you can see, any chain of significant operations ends up being broken into a series of callbacks. While this is quite functional, it rapidly devolves into a mess that makes it very unclear how the code flows. This is probably the number one issue that drove me to Erlang.

So, what does this look like in Erlang?

#!/usr/bin/env escript
main(_) ->
  inets:start(),
  {ok,R} = http:request("http://www.google.com"),
  io:format("~p~n",[R]).

This is interesting to me because it appears to be normal imperative code. Behind the scenes, Erlang sends a message then the “process” sleeps, sends a message to the evented httpc_manager which handles the request, and eventually it receives a response message. In practice, this is very fast and effectively allows you to write code that is shaped normally. However, the chunks of executed code and passed messages exactly map to how a reactor handles deferred function calls.

In C, Java, Python, or Ruby, the main thread would have been blocked. In Erlang, the “process” politely stops and the VM finds another process to run (just like EventMachine or Twisted’s reactors do). Interestingly, Erlang isn’t the only language to get this. You can do something similar in Scala or Go. I find it to be quite powerful and expect to see it a lot more in the future.

Of course, like so many things in Erlang, this is just the beginning. Using the OTP supervision hierarchy makes it easy to tie together systems of this sort of code reliably. It also neatly reduces the load required for safely handling errors, since the supervision tree can neatly replace hundreds of error-handling callbacks (not pictured above). Anyone who has ever had to debug a large event-driven mess will appreciate where I’m coming from. But that’s another post…

Finding Dependencies for Erlang Code

// December 17th, 2009 // Comments Off // Uncategorized

There was a short but lively discussion on the Erlang Questions list today about Makefiles and building Erlang projects.  It included a link to an interesting document on Recursive Makefiles.  In that document, I noticed an interesting tidbit that I’ve never thought about.

When building big Makefiles, you want accurate (but automatic) dependency information.  To make this easier, some C compilers (notably gcc) have a "-M" option.  Given this option, the compiler will create a file (usually with .d extension) containing Makefile code linking each file to its dependencies.

This is handy, since it would be a pain to parse C and C++ files for included files.  Getting the include paths correct would be difficult as well.  By running through all of the files once with -M (and generating appropriate dependencies for the .d files as well) creates a system whereby make can maintain its own dependency information.

I find this neat.  In fact, since this system is very nearly as old as I am, I find it extra neat.

Sadly, the Erlang compiler doesn’t have a facility to automatically generate dependencies.  That said, the Erlang compiler is written in Erlang, and most of its internals are exposed.  It seemed, then, that it would not be too much trouble to use those internals to generate the dependency information I wanted.

Ignoring, a particularly sadistic list comprehension that I had to decipher, it wasn’t too much trouble to put the pieces together.  Thus, I present you with a rough escript that will generate dependencies for a list of Erlang source files.  Despite Erlang’s verbose nature, it clocks in at 50 lines.  Not too shabby!

You can download it at:  http://needlesslymessianic.com/files/erl_deps

I don’t know about you, but I think this is pretty cool.  I think it’s most cool because you have all of the flexibility you get having a self-hosted language (like C, PyPy, and Rubinius; unlike CPython or Matts Ruby), but all of the introspection that you get in a dynamic language.  Being able to use the same functions that the compiler uses is very powerful.

Anyways, enjoy.  Comments welcome.