Shipping API

Post-Mortem 6/10: Redis to RabbitMQ Migration

Wed 15 Jun 2016
By Shippo Engineering

Past Friday morning, 6/10/2016 at 6:38am pacific time, we experienced an outage lasting 42 minutes. During the outage users were unable to access our API, and instead received HTTP 500 errors. The failed requests included but were not limited to label purchases and rating requests. 

We sincerely regret the inconvenience this caused our customers. We hold ourselves to a high standard for system stability and provide at least a 99.9% SLA to all of our customers. To make sure we maintain our commitments, we have strict procedures for system upgrades, but unfortunately, despite our prior testing and migration planning a problem still occurred that brought down the access to our API. We let you down when we did not deliver. Please know that we have and will continue to work on making our systems better and more resilient so events like this are less and less likely in the future.
   
The Background
As part of our technology infrastructure we use Redis as our message broker system. As the load on our systems grows from the increasing number of labels our customers are printing, we made the decision to migrate from Redis to RabbitMQ, which offers a number of advantages to help us improve our stability and handle the increasing volume of requests.

The Redis to RabbitMQ migration was scheduled for 9pm pacific time on June 8th, to make sure we minimized the impact if anything went wrong. We performed the migration, RabbitMQ began to handle the production load, and after closely monitoring all systems for a 3 hours, we declared the migration a success.

The Crash
The following events occurred on Friday morning:

6:36 am: RabbitMQ server 1 runs out of file descriptors

=WARNING REPORT==== 10-Jun-2016::13:36:26 ===
file descriptor limit alarm set.
********************************************************************
*** New connections will not be accepted until this alarm clears ***
********************************************************************


6:38 am: RabbitMQ server 1 crashes

=ERROR REPORT==== 10-Jun-2016::13:38:43 ===
** Generic server rabbit_disk_monitor terminating
** Last message in was update
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@ip-xx-xx-x-xxx",
                               50000000,49028952064,100,10000,
                               #Ref<0.0.184.142894>,false,true}
** Reason for termination ==
** {unparseable,[]}


6:39 am: RabbitMQ server 2 runs out of file descriptors

=WARNING REPORT==== 10-Jun-2016::13:39:27 ===
file descriptor limit alarm set.
********************************************************************
*** New connections will not be accepted until this alarm clears ***
********************************************************************


6:40 am: RabbitMQ server 2 crashes

=ERROR REPORT==== 10-Jun-2016::13:40:57 ===
** Generic server rabbit_disk_monitor terminating
** Last message in was update
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@ip-xx-xx-x-xxx",
50000000,49048866816,100,10000,
#Ref<0.0.172.96593>,false,true}
** Reason for termination ==
** {unparseable,[]}


6:44 am: Alerts go out to the team, investigation is started

7:05 am: RabbitMQ is identified as root cause, rollback to Redis is started

7:20 am: Rollback complete, errors stop

The Investigation
After completing the investigation into the crash, we determined that the root cause of the system crash to be RabbitMQ’s host configuration, which defaults to 1024 file descriptors, leaving ~900 available for connections. This happens if there's no explicit configuration in /etc/defaults/rabbitmq-servers to increase the maximum. 

File descriptors are used for network sockets and to access files needed (for example, queued messages that are stored on disk, so each concurrent message needs a file descriptor). 

When the new RabbitMQ servers went online, the volume wasn’t enough to run into this maximum. While having maxed out file descriptors generally would not cause RabbitMQ to completely shut down, running RabbitMQ in a cluster configuration exasperated the situation. As one machine shut down, all of the traffic was directed to the remaining machine, overloading its ports, which in turn caused the entire RabbitMQ cluster to fail, leading to the downtime.

Going Forward
To prevent this specific issue from happening again, we’ve raised file descriptors from the default of 1024 to 102,400, which is 20x our required capacity. We’re also making the following adjustments to our maintenance and deployment procedures to prevent similar issues from taking place in the future:

Additional System Monitoring
We didn’t find out about the crash until the server shut down. We should have known about a potential issue when the RabbitMQ was getting close to its limits. We will be using the RabbitMQ API to fetch statistics and other information, (such as file descriptors, maximum memory, etc) and pushing it into our existing alerting platform.

Distributed Load Testing Environment
The bug is only exposed when the RabbitMQ cluster is handling requests from multiple servers: each server would require it’s own connection pool, requiring more file descriptors. The testing environment we used accounted for concurrent users, however did not account for distributed consumers of the RabbitMQ cluster. Henceforth, we’ve set up an environment to run distributed load testing for all future infrastructure changes.

We let our customers down, but we will learn from the incident and are striving to eliminate problems like this in the future.

Next post

Jun 17, 2016
IRCE 2016 - So many ways to ship

Previous posts

Jun 14, 2016
Learnings from Integrating the UberRUSH API

Jun 10, 2016
DHL eCommerce Now Supported on Shippo

Sign up for the Shippo Blog

Receive emails with news and announcements we post on this blog.