Monday, 4 November 2013

Bulimic Clients: Applications that needed Sync

This post is part of a series on the rational behind my current project, Cortex Sync Platform.

If I were to make a list of words that should be considered technological profanity, "sync" and "merge" would be damn near the top along side "vista", "visual basic", and "arrays starting at 1".

The first memories that "sync" conjures up would be terribly sloppy experiences of Palm Pilot contact lists and null modem cables. In more recent years it's come to mean taking two legacy SQL tables generated in different places, and trying to piece together the rows using imperfect and unindexed column criteria, while you know that you're probably duplicating items because someone spelt their email wrong one time, and losing other data because there are two people named "John Smith" in the system.

"Merge" brings up even more frustrating thoughts of trying to commit your change when an architect has liberally refactored all of the code that your new feature was written on top of.

So "sync" and "merge" have become dirty words as far as development goes, whether it's about the code you're writing, the records in your database, or the state of your distributed application. I've seen far too many applications try to wiggle their way out of needing to use these dirty words, but I think that the world needs to start cussing more often when it comes to app development.
(Merging is actually a component of syncing, I just point it out specifically because it's the dirtiest part)

Defining Sync

Less poetically, syncing is often used as an abstract term rather than a discrete function, so what is it exactly?

Sync is the hard-core solution to diverging state in a distributed system. I want to talk about mobile application development rather than databases or CDNs, for the context of this post I'll pose the  "diverging state" problem like this:
  • Multiple phones are running the same app
  • The app makes "changes" to its own state (these could be messages sent in an IM, actions in a game, etc.)
  • Each phone can make as many changes as it likes, and take as long as it likes before sending them to the server (for connection reasons or otherwise).
  • Once a change is sent to the server, it should be sent to all the other phones immediately, and each of those phones should immediately reflect the new change accordingly. The change may need to undergo some modification on the server, and receiving clients.
  • You can assume that every change made locally will move the local app from one valid state to another valid state.
  • Every change received from the server (which follows the above rule) must also move the local app from one valid state to another valid state.
  • All phones which have pushed all of their changes to the server, and have received all changes from the server, should have converged to identical app states.
Note that I'm deliberately not talking about distributed servers. Paxos is a reasonable catch-all when you've got reasonably fast connections between nodes with reasonable uptime, and having n^2 persistent connections for n nodes across your network is acceptable. It falls apart for mobile device use cases, where you want 0 or 1 persistent connections, and have terrible connection uptime relative to the frequency that changes are made on those nodes.

The tough part of solving diverging state/sync is ensuring that you keep the applications in a valid state. You need to make sure that you do this without throwing away a bunch of data that your users would like to keep, and without replicating a bunch of data that your users wouldn't like to read multiple times. You also need to make sure that the way things get merged makes sense in the context of your app. If your head isn't spinning from this, then you haven't written and maintained a distributed application.

For this post, I'm calling all of the data that might need to be synced across clients the "application model". "Changes" are any kind of description of an operation to modify that model.

Let's consider the two reasonable modern approaches to working around this problem:

The Modern Easy Solution: Thin Clients

In many cases, we can get around even trying to implement sync by taking logical authority away from the client. Meaning that very low level input is sent from the client to the server, and the server evaluates that input in the order received (considering multiple clients). Each time the server does one of these evaluations, it sends a full copy of any parts of the application model that might have been affected by that particular evaluation, to every client that the server knows is still alive. The key here, is that the application model is never changed directly by the client, not even the client's local copy.

Thin clients are nothing new, and are often the correct choice when the tradeoff of KISS vs performance leans heavily towards KISS. Their drawback is that resolving any of even the simplest operations requires a round-trip delay, and an interruption in connectivity to the server means that the app totally stops working past the primitive operations of the client. This works fine for content that is primarily read-only (think news website) or for interactive apps where delays on the order of a few seconds are acceptable (think settings, comments pages), but falters when the results of your own changes must be near instantaneous or when the changes of other clients must be rendered in real-time (instant messaging, games, collaborative editing).

The Modern Hard Solution: Bulimic Clients

You might have thought that I was going to say "Fat Client", but you'd be wrong, because making a fat client requires sync, and no one wants to implement sync. "Bulimic Client" is a term that I've been giving to systems which try to mostly keep the thin client model, but squeeze some more performance out of it using some clever engineering. This is all in the hope to not need to implement true sync.

Since this isn't a formally defined model, implementations may vary, but the typical intended process of a bulimic client looks something like this:
  1. Get the most recent data model from the server.
  2. Get a local event (eg client typing) which should change that model.
  3. Modify the local copy of the model directly, and render.
  4. Send the changes in the form of thin-client-like changes, but keep accepting input from the client
  5. When the server responds with an updated model, overwrite the parts of the model modified in step 3 with the most recent, server-authoritative data. Render again.
If you haven't picked up the analogy yet, the bulimic client maintains the appearance of a thin client to the server, while it is actually consuming local changes like a fat client (modifying the local model directly). The Bulimic Client maintains it's thin appearance to the server by purging the changes it consumed in a fat manner, and putting the perfect data from the server's model in its place.

At first glance, bulimic clients seem to provide the best of both worlds from thin and fat clients. Some benefits include:
  • Your server can be the same one used for a thin client.
  • You don't have to implement real sync.
  • You can render the effects of local changes immediately.
  • Your local copy of the model will always be updated to the most recent server copy, eventually.
And so, as an industry we've adopted the Bulimic Client pattern, and used it all over the place.

Why Bulimic Clients Should be Considered an Anti-Pattern

Naturally, I'm only going to call something "bulimic" if it's a remarkably dangerous and bug-prone method. And so, my outline of a bulimic client above actually left out 1 implicit, albeit important step. Here's the real workflow:
  1. Get the most recent data model from the server.
  2. Get a local event (eg client typing) which should change that model.
  3. Modify the local copy of the model directly, and render.
  4. Send the changes in the form of thin-client-like changes, but keep accepting input from the client
  5. When the server responds with an updated model, overwrite the parts of the model modified in step 3 with the most recent, server-authoritative data. Render again.
  6. Pray to God that you didn't overwrite anything important in step 5.
Since your fat client keeps accepting user input while waiting for a response from the server, the server response to some set of input could arrive while the request indicating another set of input is in flight. By applying the bulimic model directly, this means that step 5 could overwrite the changes made for the second set of input, and so they'll disappear until the response to the second set of input arrives. Now you've got application logic that even worse than the thin client, because as far as your users are concerned, things are just flashing on and off arbitrarily.

So what almost invariably happens at this point is you admit to yourself that avoiding sync altogether can't be done, but perhaps you only need to write logic to make this one part of the application sync properly.

So you hack together some business logic that sets all kinds of flags and priorities for some small part of your app which dictates when server updates can overwrite local updates, and which parts can overwrite what, and when to keep both, and when to discard both. And then you find that occasionally this logic can leave your application model in an invalid state, so you write some more logic to reject client or server changes under some conditions. And then... And then... And then...

And then you can't read this awful mess of quasi-temporal ordering spaghetti code, which has a bug that only occurs 1 in every 1000 runs on Tuesdays in India.

What Am I Looking For?

I enjoy hacking around on applications and ideas which are supposed to feel device-independent. Meaning that I can put down my phone while using the app, then pick up where I left off from my tablet or laptop immediately. But invariably, I find myself falling into some kind of derivative of a thin or bulimic client, and the experience is lost.

If I manage to salvage the user experience, it's only because I've sacrificed the readability of the code, or increased the complexity of the network communication by a few orders of magnitude. All of this is without to mention how fledging the relationship between the quality of the user experience vs network quality becomes.

What I want is a fat-client that I don't need to re-write for the model of every distributed application that I make. What I want is a fat-client sync platform.

I want a platform that lets me build my application's model inside of it. And the platform figures out how to do a full fledged fat-client sync of that model between a user's device, my server, and all other user devices that need to see that model.

I want a platform that tells me when new changes are synced and available to read, no matter if they were made by the local device, the server, or otherwise.

And finally, I want a platform that ensures if my clients only make changes that move the model from one valid state to another valid state, the synced model will also stay in a valid state.

That's a steep problem, and that's been the goal of my side project for the past 11 months.

More to come on that work soon.

No comments:

Post a Comment