Evolution of Our Pagerduty Playbook: Fewer alerts, more uptime
Since thousands of online businesses rely on Shippo to get shipping prices and print hundreds of shipping labels every second, we provide a 99.9% uptime guarantee.
One of the major operations we recently overhauled is alerting for the engineering team. Healthy alerting helps our team maintain a good balance between moving fast and breaking nothing. It gives us the ability to release quickly, triage edge cases that escaped testing before they impact customers, and continuously improve our product.
As with any process, as the team grew, our alerting process needed adjusting.
Goals of improving the alerting process
Letting our alerts go unmaintained meant a significant amount of wasted time and energy for the engineering team, at all times of day and night.
Poor alerting outcomes:
- Inability to uphold 99.9% uptime SLA
- Downtime for customers, loss of customer trust
- Exhausted and annoyed on-call team
Impact of improving alerting:
- Internal confidence in upholding 99.9% uptime
- Less distractions for engineering team, day-to-day
- Customers that can trust that they are in good hands
Assessing the Situation
We use PagerDuty for our alerting. As you can see, between March 2015 to 2016 we let the number of alerts get out of control.
What’s more concerning is the correlation between the number of alerts and time to acknowledgement (TTA). More alerts resulted in higher TTA, which is especially evident in December, and highly correlated in January and February.
Even worse, as we’ve allowed the number of alerts to grow, alert messages have gotten more convoluted and unclear.
[goshippo.com] A1:1 auto-prod.EnvironemtnName.us-east… “Auto-prod async worder” EnvironmentHealth (20.0) Account:goshippo.com ALERT Severity 1 Checl: Auto-prod async worker Host: auto-prod.EnvironmentName.us-east… Metric: EnvironmentHealth (20.0) Agent: [...] Occurred: [date/time] url
Luckily the unclear messages did not impact the time to resolve (TTR), and the TTR remained relatively stable compared to the increasing number of alerts. However, it also brought up the question of whether or not the alert was necessary in the first place.
Since it became obvious our alerting process was becoming ineffective, we decided to evaluate the alerting procedure from the ground up. We set out to accomplish three goals:
- Reduce alert noise
- Reduce team fatigue
- Reduce time to acknowledgement (TTA)
What we did
Correlate and categorize alerts with incidents
Before we took any action, we analyzed historical data. Since we have a detailed, timestamped logs of all Shippo incidents since the “beginning of time”, we were able to link alerts to customer-facing incidents.
This gave us a good understanding of which alerts were meaningful, and which were not. This also helped us document ongoing impacts of alerts and incidents.
Cut out noisy alerts
After getting familiar with alerts, we reviewed historical alerts to examine which of there were actual issues, and which were just noise.
Just like we thought, most alerts were non-issues that did not need our attention nor required action.
It turned out that at some point we used 6 different monitoring services, all reporting to PagerDuty. In the process of removing noisy alerts, we also removed unnecessary services as they were producing mostly noise. Of course, for any service we removed, we made sure to port over any accurate alerts.
We were able to cut many alerts out of the system, dramatically decreasing the amount of noise.
Monitor system daily for issues and regressions
After we cut out the noise, we needed to make sure that the remaining alerts are accurate and true. We also wanted to improve our alert messages so that the on-call engineer can diagnose the situation accurately and act as quickly as possible.
Auto-prod: EnvironmentHealth = 20.0
(Value map to the following health levels: 25 = Severe, 20 = Degraded, 15 = Warning, 10 = No Data, 5 = Unknown, 1 = Info, 0 = OK)
Monitor alerts weekly and carry out improvements
We started a weekly meeting to review alerts and address any impending issues. We reviewed our PagerDuty alerts for accuracy and urgency. This helped us prioritize scalability projects and, more importantly, empowered all engineers on the team to submit feedback and/or take action to improve the alerts themselves.
Defining the on-call playbook
As we grew in size, notifying the correct team members in a timely manner became very important to decrease our time to resolution (TTR).
After some experimentation, we settled on a playbook:
1. There are two engineers on the on-call rotation, a primary and a secondary.
- Primary duties:
- Alert secondary, inform the team of the alert in our #incidents Slack channel
- Proceed with investigation, formulate a solution, implement the solution, deploy it and wait for recovery.
- Declare “All Clear”
- Secondary duties:
- Help investigate, formulate, and implement solution
- Help monitor situation while the Primary firefights
- Keep #incidents posted on process, update our Status page
2. For internal communication, we’ve created a #incidents Slack channel. The state can be one of the following:
- ALL CLEAR
- EVALUATING ALERT
- ACTIVE INCIDENT:
3. “Waking someone”, escalating or raising someone to ask for help. The key action here is to establish the urgency of the situation to this person so that they understand that this is not a normal day-to-day communication.
With the on-call playbook, we also defined an rotation, to make sure that on-call engineers allocate time out of their sprints, and select appropriate non-critical tasks during their weeks so that they’re able to respond to alerts.
Results of alerting improvements
Since we implemented these new processes, our number of alerts, TTA, and TTR have significantly improved.
- Number of alerts decreased by 80%
- TTA reduced from 25mins to 4mins
- On-call engineer is less frequently buzzed = less fatigue and better response time
- Alerts are better documented and prepared for quick action
- Better documentation, processes, and incident reports for entire team
If writing code that sends millions of packages across the world and formulating solutions for long-term use interests you, we’re hiring for engineers to join our on-call rotation ;)
Sep 09, 2016
Aug 15, 2016
Dec 12, 2016
Nov 10, 2016
Sign up for the Shippo Blog
Receive emails with news and announcements we post on this blog.
Receive emails with news and announcements we post on this blog.