Easy as 1, 2, 3, right? Lessons from Our Incrementing Service
This post will show how counting with computers isn’t as easy as The Count from Sesame Street makes it out to be. Distributed incrementing — i.e. 1, 2, 3, etc. — gets complicated in distributed systems. Why would you want incrementing like this? At Shippo, we can work with external parties and can receive allocated ranges of identifiers from carriers like FedEx, UPS, or USPS. We must stay within the range and avoid vending the same identifier twice.
As part of refactoring our Shippo codebase, we extracted this incrementing logic into a microservice. We’ll show the journey we used to find our best approach, and what we all can learn.
How hard can this be?
If you approach this as a function, all you need is a long-lived variable that increments its counter, which won’t repeat a value for subsequent calls. However, if the process dies, then restarting it will cause double vended identifiers. Also, if multiple hosts are in play, you’ll repeat the same values without some central coordination, like a database. We’ll explore that and other considered options, highlighting what we learned along the way.
Option 1: Keep It Simple Stupid
By far the smartest thing to do is make the logic intentionally dumb, and if that seems like an oxymoron let’s explain first: By “smart” we mean easiest to maintain and by “dumb” we mean requiring the least amount of code, no central storage, and no coordination across boxes either, all with no discernible latency penalty. How can you have it that good? The answer lies first in acknowledging that we live in a world of probabilities. There exists a probability that a computer disk can crash, and zooming out, that an entire data center can go down. If you take a non-zero probability that something undesirable happens but it’s less likely than two simultaneous data center outages, then you can say it is negligible enough to be effectively zero.
Mathematically, if your identifier space is big enough, then you can make the probability of ever getting a duplicate identifier, even after centuries, be infinitesimally small.
Unfortunately, the ranges external parties granted us were not large enough to make this practical. It actually doesn’t take much to make things work: GUIDs/UUIDs have 128 bits and the SHA1 space that Git uses heavily has 160 bits: Those would work as discussed in the birthday paradox Wikipedia article. But we were only in the range of 25–50 bits: Not enough, sadly. Legacy reasons like barcodes make this space more tight. You can mitigate by encoding time as part of the identifier so velocity vs. lifetime growth is the limiting factor, but the tradeoff of less “core” bits didn’t help our use case.
Option 2: Direct Incrementing in SQL or NoSQL
ACID SQL engines mean multiple hosts against the same DB will not double vend. This is not just for a single sequence, but several for different external parties and use cases. In DB-speak, all we needed was row-level locking, not table-level locking, which comes out of the box when executing UPDATE queries. The problem is that before we moved to a microservice, a similar approach already existed and we knew it could become a serious bottleneck. Running load tests both in our older code and in a prototype microservice confirmed things could scale to a few dozens requests per second, but would not scale to the low 100s of request per second we needed over the horizon. So the door closed for SQL UPDATEs too.
With SQL out of the picture, was NoSQL an option for us? NoSQL’s typical lack of consistency and focus on eventual consistency might’ve been a showstopper, but it turns out AWS vends a DynamoDB feature called Atomic Counters. DynamoDB allows optional consistent writes and Atomic counters wrap that. However, considering how SQL UPDATEs were already a demonstrated bottleneck, DynamoDB not having a long-lived connection pool or closer network topology made us realize it wouldn’t work either. DynamoDB advertised Atomic Counters necessarily meant one-at-a-time writes, which didn’t add up unless we had something lightning fast at our disposal.
Option 3: Spreading out the bottleneck
Both SQL and NoSQL had a variation that could dramatically mitigate: What if that row-level locking “hot spot” were not concentrated on one row but spread out behind the scenes to N different rows. Concretely, if you had a range of 1 trillion and you wanted 10 groups, you could assign one of them 0–100 billion, another 100–200 billion, and so on until 900 billion to 1 trillion. You have external constraints on identifier size, but internal implementation you can be more creative. Direct clients don’t pass another number, the service simply picks one round robin or randomly. Doesn’t that just kick the problem you have down the road? Yes, but if it’s not by months but by decades, then you likely have something viable. This was a solution that could have been successful. We were readying to pursue it when someone asked, “What about DB sequences?”
Option 4: Our Lightning-Fast Heros
First off: What are DB sequences? They are are provided by SQL engine vendor as tiny tables fine-tuned for incrementing, and you set them up with CREATE SEQUENCE instead of CREATE TABLE. So people can rightfully ask “What about DB sequences?” because they were optimized for lightning-fast incrementing, much faster than SQL UPDATEs. It was easy to create another prototype and see how it performed under a load test. The answer was astonishing: SQL UPDATEs couldn’t even make it halfway to the triple digit request rate, but the PostgreSQL sequence way successfully processed quadruple digits per second.
Gut intuition could tell you, “Anything that inherently processes requests sequentially is bound to have a limit.” While true in theory, if that limit doesn’t come into play in practice, then it is merely an alleged bottleneck that’s a rounding error compared to the logging or other processing your service normally does.
There is also rightful aversion to technology that isn’t in the newer direction you want to go in, in this case SQL vs. NoSQL. But if the coding cost is more significant for Option 3 and the risk of custom code going wrong or not aligning to timelines vs. the simpler and tried-and-true database sequence of Option 4, you would be silly to not have that be your Plan A.
The DB sequence path still brought us interesting challenges, which we won’t fully dive into for this post, but I’ll leave you with one. Since DB sequences are modeled like tables, having variable table names isn’t something you can do with PREPARE statements which not only boost performance but will auto-sanitize your parameters.