Past Friday morning, 6/10/2016 at 6:38am pacific time, we experienced an outage lasting 42 minutes. During the outage users were unable to access our API, and instead received HTTP 500 errors. The failed requests included but were not limited to label purchases and rating requests.
We sincerely regret the inconvenience this caused our customers. We hold ourselves to a high standard for system stability and provide at least a 99.9% SLA to all of our customers. To make sure we maintain our commitments, we have strict procedures for system upgrades, but unfortunately, despite our prior testing and migration planning a problem still occurred that brought down the access to our API. We let you down when we did not deliver. Please know that we have and will continue to work on making our systems better and more resilient so events like this are less and less likely in the future.
As part of our technology infrastructure we use Redis as our message broker system. As the load on our systems grows from the increasing number of labels our customers are printing, we made the decision to migrate from Redis to RabbitMQ, which offers a number of advantages to help us improve our stability and handle the increasing volume of requests.
The Redis to RabbitMQ migration was scheduled for 9pm pacific time on June 8th, to make sure we minimized the impact if anything went wrong. We performed the migration, RabbitMQ began to handle the production load, and after closely monitoring all systems for a 3 hours, we declared the migration a success.
The following events occurred on Friday morning:
6:36 am: RabbitMQ server 1 runs out of file descriptors
6:38 am: RabbitMQ server 1 crashes
6:39 am: RabbitMQ server 2 runs out of file descriptors
6:40 am: RabbitMQ server 2 crashes
6:44 am: Alerts go out to the team, investigation is started
7:05 am: RabbitMQ is identified as root cause, rollback to Redis is started
7:20 am: Rollback complete, errors stop
After completing the investigation into the crash, we determined that the root cause of the system crash to be RabbitMQ’s host configuration, which defaults to 1024 file descriptors, leaving ~900 available for connections. This happens if there’s no explicit configuration in /etc/defaults/rabbitmq-servers to increase the maximum.
File descriptors are used for network sockets and to access files needed (for example, queued messages that are stored on disk, so each concurrent message needs a file descriptor).
When the new RabbitMQ servers went online, the volume wasn’t enough to run into this maximum. While having maxed out file descriptors generally would not cause RabbitMQ to completely shut down, running RabbitMQ in a cluster configuration exasperated the situation. As one machine shut down, all of the traffic was directed to the remaining machine, overloading its ports, which in turn caused the entire RabbitMQ cluster to fail, leading to the downtime.
To prevent this specific issue from happening again, we’ve raised file descriptors from the default of 1024 to 102,400, which is 20x our required capacity. We’re also making the following adjustments to our maintenance and deployment procedures to prevent similar issues from taking place in the future:
Additional System Monitoring
We didn’t find out about the crash until the server shut down. We should have known about a potential issue when the RabbitMQ was getting close to its limits. We will be using the RabbitMQ API to fetch statistics and other information, (such as file descriptors, maximum memory, etc) and pushing it into our existing alerting platform.
Distributed Load Testing Environment
The bug is only exposed when the RabbitMQ cluster is handling requests from multiple servers: each server would require it’s own connection pool, requiring more file descriptors. The testing environment we used accounted for concurrent users, however did not account for distributed consumers of the RabbitMQ cluster. Henceforth, we’ve set up an environment to run distributed load testing for all future infrastructure changes.
We let our customers down, but we will learn from the incident and are striving to eliminate problems like this in the future.
Shippo is a multi-carrier API and web app that helps retailers, marketplaces and platforms connect to a global network of carriers. Businesses use Shippo to get real-time rates, print labels, automate international paperwork, track packages and facilitate returns. Shippo provides the tools to help businesses succeed through shipping.