The Road to Aurora PostgreSQL with Near-zero Downtime
After the 2018 holiday season, we realized that the rapid business growth of Shippo would soon cause our transactional Amazon RDS for PostgreSQL database to hit the supported “ceiling” with the current version of RDS, before the holiday peak in 2019—in terms of storage, IOPS, and availability. After an extensive evaluation, we decided to migrate our database from AWS RDS for PostgreSQL to the latest AWS Aurora PostgreSQL compatible database.
AWS RDS for PostgreSQL is the Amazon-managed PostgreSQL in the cloud, one of the most preferred and stable open-source relational database systems. While keeping all the features PostgreSQL offers, it frees up the users from most of the database administrative tasks. However, it also inherits the limitation of upgrading the PostgreSQL database engine, which requires downtime for a major version upgrade. For example, from v9.6 to v10.x, downtime of the application system is required to perform the upgrade, even using the fastest upgrade option, the in-place upgrade.
Depending on the size of the database and the number of major versions between the current version and the targeted version, the downtime can be hours. Additionally, we needed to migrate RDS for PostgreSQL to Aurora PostgreSQL either through creating an Aurora read replica or restoring the latest snapshot backup, which would potentially incur more downtime.
The Road to AWS Aurora
In our case, the normal migration path would require about eight hours of downtime, which would result in significant lost business revenue and thousands of unhappy customers. This was not an acceptable option for us.
We reached out to AWS, and worked closely and extensively with AWS’s team of solution architects, database specialists, and migration experts to reduce the downtime window. Despite our efforts, however, significant downtime was unavoidable with an in-place upgrade.
Going back to the drawing board, and looking for different approaches, we discovered the AWS Database Migration Service (DMS). DMS is targeted at customers migrating from on-premise to the cloud, though we realized that this service can be used for our use case as well, and ensure a near-zero downtime migration.
AWS Database Migration Service (DMS) uses PostgreSQL logical replication for near-real-time synchronization of data between the source and target databases. The source database can remain fully operational during the migration until the target database is promoted to be the primary database, minimizing downtime to applications that rely on the database. After conducting a successful Proof of Concept (POC) test with the help of the AWS team, we decided that a one-step migration to Aurora was not only feasible, but also safe.
At a high level, our approach included the following steps:
- Enable the source database to support DMS.
- Set up the target Aurora environment based on the target Aurora PostgreSQL.
- Perform database schema migration to Aurora using the native PostgreSQL tool.
- Set up an AWS DMS environment including replication server, and fine-tune data migration tasks [full load and change data capture (CDC)].
- Start the live data migration tasks, monitoring, and then cut over to the target environment.
The following data flow diagram illustrates the high-level architecture:
The database system is the heart of our transactional application. The key to a successful database migration project is the careful and thorough planning of the project, as well as extensive testing. We created a detailed work plan with milestones, and go/no-go decisions to track our work. Along the road to Aurora, we went through the following steps to get ready:
- POC test on DMS migration and Aurora
- Production-like performance and cutover tests on both DMS and Aurora
- Application, functional, regression, and full-compatibility tests on Aurora
- Go-live in production
- Monitoring production after go-live
The Road To Aurora Was Not Without Some “Bumps”
Like any new software product, DMS is not a bug-proof product. As we dove deeper and executed our implementation and testing plan, we uncovered a number of potential showstoppers.
Here are some of the issues we encountered:
- The full load of some large tables, such as tables with billions of rows and several hundreds of GBs in storage, never completed. This was caused by a bug discovered in both older and latest versions (v2.4.5 and v3.1.3) of the DMS replication server. The AWS engineering team managed to release a patch to resolve the issue quickly. Since the full load replication for a large table may take many hours to complete, we had to start the replication tasks a few days in advance before the migration go-live to allow enough time for them to catch up.
- CDC (Capture Data Change) replication lagged too much to catch up due to the large volume of active changes. This happened to our large partitioned tables. AWS engineering released some patches to beef up the DMS replication server, but that didn’t eliminate the lagging issue. After many trial-and-error tests and optimizations on the replication tasks, we eventually created multiple parallel DMS replication tasks at the partition level. This resulted in reducing the volume of activities on each task, speeding up the overall CDC replication process.
- The partitioned table was not well supported. The partitioned table implemented by Inheritance isn’t supported by DMS yet. We created some workarounds per DMS implementation instructions. However, we still ran into data integrity issues including duplicated rows. We ended up with creating multiple rules (trigger not supported) in the partitions to ensure data uniqueness and integrity in the target, yet minimizing the overhead that might affect the CDC replication task performance.
In addition, we ran into a number of data issues and compatibility issues related to sequences, triggers, and foreign key constraints, as well as some DMS bugs. Working closely with AWS’s DMS engineering team, we managed to develop a number of patches, optimize our processes, and modify our batch job workflow to minimize all risks. Of course, we also developed a Plan B as “the best-laid plans of mice and men oft go astray.”
After three months of hard work and collaboration with AWS, we were ready to go live.
Besides minimizing the downtime at the database level during the cutover from the old database to the new Aurora database, we used a blue/green deployment. The blue/green deployment meant no application configuration change was required, and also provided a very fast rollback in case of problems. This approach resulted in a cutover to the new database with less than five minutes of downtime.
Below is our application deployment architecture:
- Ask for help: Don’t hesitate to ask AWS for help with your unique situation. We had a great partnership with the AWS Support and Engineering team to address all the issues from DMS and Aurora.
- Think outside the box: We ran into multiple blockers along the way to Aurora. Some were on the migration path and others were on DMS reliability and scalability. We put our heads together working against the “norm” and eventually came up with solutions.
- Test until the light shines home: We did a lot of tests, including functional tests, integration tests, performance/load tests, dry-runs, etc. We tested until we completely understood what should happen.
- Detailed planning: We created a detailed, minute-by-minute implementation plan with checkpoints throughout the day of the migration, which helped to ensure a smooth migration with no unexpected issues causing downtime. We also did 2 dry-runs to validate the implementation plan.
- Flexibility of infrastructure configuration: The flexible infrastructure configuration allowed us to set up a blue-green deployment to ensure a zero-customer disruption cutover from the old RDS to the new Aurora database.
- Contingency plan: You cannot guarantee unexpected things will not happen even with thorough preparation. We had a contingency plan and took the time to test our contingency plan.
- Management commitment: We ensured upper management had our back for the project and kept them informed.
It All Paid Off
- Boost to the overall system: We had an overall system performance improvement with Aurora, from unlimited IOPS, more storage, improved query execution time to high availability, near-zero lagging replication, and enhanced monitoring.
- Another win: We solved another scalability problem that affected the legacy database schema model, applying what we learned from utilizing DMS. We could apply a large scope of global database schema change and optimization without disrupting customers, which was another big win for us.
- Showcase in AWS Aurora Blog: We ended up being featured in the AWS Aurora blog!
One of our goals in Engineering is to always look for new ways and ideas to significantly improve and advance our technology stacks. The experiences and learning from this journey will help to provide a framework on how we can achieve this goal reliably, effectively, and timely. Last but not least, we are very thankful for the Shippo Engineering team upholding Shippo’s values, “Passion for hard challenges,” and “We haven’t won yet,” as well as the very supportive AWS team.