Migrating a physical DB to unlock a mass migration into AWS - Part 1

Posted by Chris McKinnel - 22 April 2022
8 minute read

I was introduced to one of our customers last year that had their finger on the trigger to buy new physical hardware for a database that run their core business application as the hardware was almost out of warranty.

The application is used by thousands of people across multiple time zones, and if it breaks it costs thousands of dollars a minute in lost revenue.

When I was introduced to the customer, they had already been in contact with AWS and had been given some cost estimates for a like-for-like database in AWS which looked like it was much more expensive than what they were currently paying for the physical server over its life expectancy.

The MS SQL Server licensing costs were the main killer for RDS - the goodness that you get by using RDS (totally managed instances, backup, point in time restore, etc) were offset by the baked in licensing costs. The customer had been grandfathered into a favourable licensing deal with Microsoft over the years, and not being able to BYOL to RDS made the numbers skew the wrong way.

Me every time a customer mentions Microsoft licensing

The other elephant in the room was whether or not there was a DB instance or service on AWS that would be able to match the performance of the physical hardware while not being cost prohibitive. Importantly though, the customer did indicate that they would be happy to pay a little bit more for the inherent benefits of using cloud-based resources - like high availability, backup, etc.

After our first conversation, it became clear that what they really wanted was a trusted local partner to help guide them through the process of figuring out whether or not it would be possible to migrate this critical application into AWS, for a price point that didn't absolutely crush them - including a licensing model that would work.

The database itself was so key to the business that a lot of the production workloads lived in in-country IaaS so they could get low-latency comms to the database. If the database couldn't move to AWS, then neither could these other production workloads.

Zooming out

We widened the scope of the conversation a little bit as the customer had been deploying development and test environments into their AWS environment for a couple of years, and things had started to sprawl.

It was clear that AWS had served a purpose for this customer over the last couple of years and they were poised to take the next step to get an environment that had a good enough security posture, governance and foundational infrastructure to hold the workloads that acted as the engine room for the business.

We agreed on a two-pronged approach to the challenge, we (CCL) would engage AWS as an Advanced Partner and look at funding options to help accelerate the discovery around the physical database, and we would do a deep-dive Well Architected session into their existing AWS environment.

It's all good talking about migrating production workloads into AWS, but without a good foundation it can be difficult to get business buy in when the rubber starts hitting the road and hard questions start to get asked around security, resiliency and governance.

Funding for the database migration discovery

We put together a business case for the migration of the database and other production workloads that were associated with the database, and presented it to AWS with a funding request to help the customer pay for our professional services to evaluate various options for getting the database into AWS.

The funding request was successful which indicated to the customer that AWS were serious about wanting to come to the party and assist with the migration, as well as showing the customer that CCL had strong relationships with AWS and was a trusted AWS partner.

With the funding, we agreed to test out the database with PostgreSQL on Aurora, various configurations of MS SQL on RDS and EC2, and then do some performance testing to find out the most suitable type and size for the instance.

AWS has a heap of funding programs available for almost every migration use-case, so if you're thinking of migrating to AWS then get in touch with your AWS partner as they can help you navigate the many programs to get on the right one for you.

Well Architected

While our engineers were going through the database work (see below for more detail), we also initiated a deep-dive Well Architected Review on the customers existing environment.

Well Architected Reviews are based on the 5 pillars of the Well Architected Framework, and usually is a 2-4 hour set of interviews with platform stakeholders. Basically, you go through a set of questions provided by AWS in the Well Architected tool and then you get a PDF report that tells you the areas where you can and should improve your posture.

The Well Architected Framework and Reviews are focused towards individual workloads, and should be done regularly to make sure that any changes to your workloads are still adhering to best practices.

The only downside is these are eyes-off reviews, so we don't get access to the customers AWS accounts to really see what's going on in there.

We already had a decent idea of where this customer was at based on initial conversations, so an eyes-off review wasn't a great fit. Instead, we did an eyes-on review which AWS call a Deep Dive Well Architected Review. We call it something else again, but the point is we got read-only access into the customers account and had a decent poke around to see where things were at.

Generally the eyes-on review will shine a light on platform level configuration that might not be picked up in eyes-off question and answer sessions at a workload level. Things like CloudTrail mis-configuration, MFA on root accounts, permissive IAM policies, insecure security groups, sub-optimal network architecture, etc. The type of stuff you want to get right if you're going to be putting production data anywhere near the environment.

Nobody likes being on the front page news with "security" or "breach" in the headline!

Anyway, we did a deep-dive Well Architected Review for this customer and reported on a decent amount of improvements that could be made. We also suggested that this customer would benefit from a new Landing Zone, deployed with Control Tower.

Landing Zone via Control Tower

We spent a decent amount of time getting a new Landing Zone deployed with some foundational services in there so it was ready to take on production workloads, syncing up with other vendors to get Direct Connect hooked up, getting DNS services deployed and tested and extending the AD domain.

Along with technical resources from the customer, everything we deployed was defined using Infrastructure as Code with Terraform and Terraform Cloud. We helped instil some best practices so the benefits software deployment lifecycle for infrastructure were realised.

MS SQL Server to PostgreSQL and back again

Meanwhile, we took the main business application on a journey of database discovery. We:

Performed a major version update of the application
Attempted a database backend change to PostgreSQL
Migrated the database to RDS and tested
Attempted PostgreSQL on EC2

The main challenge we ran into while attempting to migrate to PostgreSQL is users of the application have the ability to write scripts in VB and interrogate the database directly using hard-coded TSQL. The scripts could be run directly from the application interface, and it turned out that there were thousands of these scripts that had been written over the lifecycle of the software and if any were to be changed (from TSQL to PostgreSQL compatible SQL) there would be a large testing effort required to ensure that they behaved the same way.

These scripts were the core of the application, and were heavily used by all users of the application.

Where we ended up was keeping the application using MS SQL Server, and running it on an EC2 instance. Mainly due to the hard-coded TSQL and associated testing effort, but also due to licensing and some features the database was using that RDS didn't support.

Running the database on EC2 instances with a like-for-like instance sizing was estimated to be more expensive than purchasing more hardware and continuing the status quo. So we needed to zoom out and show a bigger picture for the customer to agree to using EC2 for the big database.

Migrate all the things

This customer had deployed resources in multiple disparate AWS accounts, in-country IaaS and even Azure.

We took an export of the IaaS resources that we knew about, and made some assumptions about the Azure resources and fed them into the AWS Migration Portfolio Assessment tool, which is available to AWS Advanced Partners to help with migration business cases and TCO analysis.

The results were pretty clear, this customer would save a lot of money by going all in with AWS and doing a mass migration.

We took this information and pitched a full migration into AWS, and explained that while the key database might be more expensive on AWS than it was currently on hardware, the savings that would be achievable by moving all of the things that were currently pinned to IaaS by the database were so compelling that a migration was the best option.

As long as the performance testing on the database instance on AWS worked out, we were on for a mass migration!

Performance testing MS SQL on AWS with customer data

We had a key metric from the physical database server that we were aiming to achieve on AWS - 6000 transactions per second.

We started by benchmarking and like-for-like instance size compared to the physical machine using HammerDB, and we got up to 3000 transactions per second with the database getting absolutely smashed at the same time.

That was a good start, but it wasn't realistic because it wasn't the same traffic pattern as the production database, and it wasn't actually using production data.

Next we got the production data into the new Landing Zone and imported it into the database. Now we had real production data to deal with, we needed to figure out how to generate traffic that mimicked how actual users were using the application.

We took a 24 hour profile of the queries being run on the production database and filtered the top 100 most expensive queries to use in our performance test.

If we could get these 100 queries to execute at 6000 transactions a second or more, we would be in good shape.

By writing a script that executed one of the top 100 queries, slept for a random amount of time and then executed another until it was told to stop, and deploying this to AWS Lambda at scale, we were able to simulate a large amount of users that were only executing expensive queries.

With Lambda scaling out we managed to get around 17,000 transactions per second. Bearing in mind that this was with production data, and using the most expensive queries recorded from actual users on the production database.

Because these results were so good, we looked at the bottleneck on the database EC2 instance and right-typed and right-sized it, and re-ran the tests.

On a memory optimised instance family, we were able to get 8000 transactions per second on an instance half the size of the production physical machine.

When we presented the customer with these results, we got sign off to continue with the process of assessing and planning for a mass migration into AWS.

Next steps

Now that all the blockers for a mass migration into AWS have been overcome, it's time to work through the process of assessing the landscape in detail, planning the migration and finally executing it - all the while keeping the business running and using the core business applications around the clock.

Stay tuned for Part 2 of this process - the next write-up will be on the actual migration. How exactly do you migrate such a business-critical application into AWS? It's like fixing a plane at the same time as flying it!