How to ensure site availability with AWS Route 53 DNS failover
Hands up if you got caught out by the Great Fastly Outage of 2021?
✋
At Bared Footwear we use a number of Google Cloud products for our e-commerce site, integration software, and data warehousing. Having spent most of my career in AWS land, I had no idea Firebase Hosting used Fastly edge cloud under the hood!
During the outage, we discussed a number of possible solutions. One of the advantages of embracing a Jamstack architecture — in our case, Shopify Plus, Gatsby, and Sanity — is the ability to shift your frontend build from one platform to another quickly and easily.
Given there was little in the way of information coming from Fastly, let alone an ETA, we made the decision to create an ad-hoc build locally and deploy it to AWS, which was unaffected by the outage.
Thankfully during the deployment process Fastly came back up, so our site was back online and our customers were able to checkout as per normal. For some teams that may have been the end of the story: "We were only offline for an hour or so, we're back up now, and come on it's Google, what are the chances of this happening again? We don't need to do anything else."
But not this team! #kaizen4life
Over the next day or so we set about implementing a way to automatically switchover to a secondary frontend should a similar outage ever happen again.
And now you can too with my simple 4-step guide!
Step 1: S3 and Cloudfront
Our first step was to create an S3 bucket for hosting our frontend, and a Cloudfront distribution which used the S3 bucket as its origin. For the most part, Cloudfront works out of the box but there are a few steps we needed to go through to get a static site playing nicely with S3 and Cloudfront such as:
Creating SSL certificates in AWS Certificate Manager so we can support TLS
Setting the default root object to index.html
One gotcha we came across is that index.html is not a default document for subfolders, so Cloudfront will return a 403 error. Previously a Lambda@Edge function would have solved this, however Cloudfront has recently introduced Cloudfront Functions, which abstracts away a lot of the pain involved in setting up a Lambda@Edge function.
This function intercepts user requests and changes the origin request. For example: https://baredfootwear.com/collections/mens-sneakers transforms the incoming request to https://baredfootwear.com/collections/mens-sneakers/index.html which Cloudfront then can successfully map to S3 in the background.
function handler(event, context, callback) { var request = event.request; var uri = request.uri; if (uri.endsWith('/')) { request.uri += 'index.html'; } else if (!uri.includes('.')) { request.uri += '/index.html'; } return request; }
Then it was just a matter of editing the default behaviour under Settings -> Behaviors -> Edit default, and associating the function we just created (which we called DefaultDocument) with the Viewer Request event.
Step 2: Health check
After we confirmed that the S3/Cloudfront setup was working as expected, we set about implementing some health checks to determine if our Firebase-hosted site was working correctly. Normally this would involve just setting up a single health check, but because Firebase recommend using multiple IP addresses for your A record (151.101.1.195 and 151.101.65.195), we had to create 3 health checks: 2 to check the individual IP addresses, and a calculated health check to evaluate the health status of the IP checks.
First we created a health check for 151.101.1.195
Then we created a similar one for 151.101.65.195
The final step was to create the calculated health check. If the IP health checks are both unhealthy, then we want to send an alert/fail the health check.
TIP: "Invert health check status" is a great way to test if your health checks are working as expected. Just check the box to simulate a failure and voila! you're able to test your failover without having to actually nuke your site.
Step 3: DNS
The next step was to update our DNS in Route 53. First, we updated our A record, changing it from a Simple type to a Primary Failover type. We then selected the calculated health check as the health check.
Next we added a Secondary Failover A record pointing to our Cloudfront distribution.
Our Route 53 hosted zone now looks something like this.
Step 4: Gatsby cloud
We use Gatsby Cloud as our build tool of choice because it offers incremental builds, fast build and deployment times, and integrated Lighthouse reports. At the moment Gatsby Cloud doesn't support multiple deployments within a single build profile (but we asked for it!), so we cloned our production build profile and altered the deployment target from Firebase to S3.
Now we have concurrent production and failover deployments whenever any changes are made to our master branch.
We ran a couple of witching hour tests by inverting the health check status, and we observed the DNS switching over to our Cloudfront distribution in a matter of 2-3 minutes.
Conclusion
Despite the perceived uptime benefits of using Jamstack, there will always be vendor outages. DNS failover is a really simple and effective way to ensure your site's availability and keep your customers happy.
Need to make your system more resilient? Get in touch to find out how I can help