Chaos Engineering with the AWS Fault Injection Simulator
Posted by Chris McKinnel - 23 November 202112 minute read
I was browsing the AWS RSS news feeds the other day and I happened to notice a service I'd never heard of called AWS Fault Injection Simulator. What? How had I not heard of this? I felt a tiny bit better when I saw it was released in March 2021, but I'm still 6 months behind the 8-ball.
It turns out this actually happens to me quite frequently, which is terrifying considering I spend a good chunk of every day working with and talking about AWS to my colleagues and customers. How do people keep up with all the new services that get released? It's like they actually have time to read the AWS news... anyway, that's a topic for another day.
I had a quick look into this new service, and as expected it's basically Chaos Engineering as a Service - neat! I figured I'd better kick the tires of this thing and see how it could help my customers.
Chaos Engineering
But first, what exactly is Chaos Engineering? Wikipedia does a decent job of defining it for us:
Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions.
So it's like of kind letting a gaggle of junior developers and engineers loose on your production environment, eventually one of them will accidentally turn some stuff off that they weren't meant to. Except with this service, you get your very own automated gaggle of juniors clicking buttons and deleting things on-demand.
I should also say I've got nothing against juniors... they just break stuff every now and again. When I was a junior I truncated some tables in a production database and then went to lunch. I was a Chaos Engineer before that term had even been coined.
Getting set up for a Chaos test
Before we get stuck into the Fault Injection Simulator, we need some resources deployed that we can inject faults into. It looks like FIS supports the following resource types:
So let's get an EC2 instance deployed and running a simple web server to see if we can inject a fault and break it. I decided to use Terraform to deploy my instances, if you want to follow along you can deploy them however you like.
First, create a providers.tf
provider "aws" {
region = "ap-southeast-2"
profile = var.aws_profile
}
This tells Terraform to use AWS as the provider.
Then, create a variables.tf
variable "aws_profile" {
type = string
description = "The name of your AWS profile - can be loaded from environment variables"
}
variable "public_key" {
type = string
description = "The SSH public key from the pair you'll use to connect to your instances - can be loaded from environment variables"
}
variable "vpc_id" {
type = string
description = "VPC to launch the EC2 instance in"
}
variable "public_subnet_a" {
type = string
description = "Subnet ID of public-subnet-a"
}
Let's define the values of these variables in terraform.tfvars
aws_profile="YOUR_PROFILE_NAME"
public_key="ssh-rsa YOUR_PUBLIC_KEY"
vpc_id="YOUR_VPC_ID"
public_subnet_a="YOUR_SUBNET_ID"
If you've got your local AWS CLI set up correctly, you should be able
to get the name of your AWS profile from either ~/.aws/config
or ~/.aws/credentials
. If you haven't got your CLI set up,
check out the
AWS documentation.
If you've got SSH keys set up, you'll probably find your public key in
~/.ssh/id_rsa.pub
or similar. If you haven't got local SSH
keys setup, check out this
helpful walkthrough.
Your AWS account should have some default VPCs and subnets that you can use for this demo. If you've deleted these and deployed your own, grab the ID of your VPC and one of your public subnets.
And create bootstrap.sh
#!/bin/sh
yum -y install httpd php mysql php-mysql
case $(ps -p 1 -o comm | tail -1) in
systemd) systemctl enable --now httpd ;;
init) chkconfig httpd on; service httpd start ;;
*) echo "Error starting httpd (OS not using init or systemd)." 2>&1
esac
if [ ! -f /var/www/html/bootcamp-app.tar.gz ]; then
cd /var/www/html
wget https://s3.amazonaws.com/immersionday-labs/bootcamp-app.tar
tar xvf bootcamp-app.tar
chown apache:root /var/www/html/rds.conf.php
fi
yum -y update
This will install a simple apache webserver - I actually borrowed this from the AWS EC2 Immersion Day labs.
Finally we're ready to create main.tf
resource "aws_key_pair" "fis_test_instances" {
key_name = "fis-test-instances"
public_key = var.public_key
}
data "aws_ami" "ubuntu" {
most_recent = true
filter {
name = "name"
values = ["amzn2-ami-kernel-5.10-hvm-2.0.20211103.1-x86_64-gp2"]
}
owners = ["137112412989"]
}
data "http" "icanhazip" {
url = "http://icanhazip.com"
}
resource "aws_security_group" "test-instance-sg" {
name = "allow-all-from-my-ip"
description = "Allow all inbound traffic from my IP only"
vpc_id = var.vpc_id
ingress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = [
"${chomp(data.http.icanhazip.body)}/32"
]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_instance" "test" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.micro"
key_name = aws_key_pair.fis_test_instances.key_name
vpc_security_group_ids = ["${aws_security_group.test-instance-sg.id}"]
subnet_id = var.public_subnet_a
user_data = file("bootstrap.sh")
tags = {
Name = "webserver-1"
}
}
We're deploying a super simple EC2 instance, telling it to use bootstrap.sh
as our user data, and setting up a security group to allow all traffic from
our own IP address only.
We should now be able to run a terraform plan
and a terraform
apply
to deploy our instance.
terraform plan
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# aws_instance.test will be created
+ resource "aws_instance" "test" {
+ ami = "ami-0c9f90931dd48d1f2"
+ arn = (known after apply)
+ associate_public_ip_address = (known after apply)
..............
+ tags_all = (known after apply)
+ vpc_id = "vpc-xxxxx"
}
Plan: 3 to add, 0 to change, 0 to destroy.
─────────────────────────────────────────────────────────────────────────────
Note: You didn't use the -out option to save this plan, so Terraform can't
guarantee to take exactly these actions if you run "terraform apply" now.
Great, looks like we can run terraform apply
!
terraform plan -auto-approve
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# aws_instance.test will be created
+ resource "aws_instance" "test" {
+ ami = "ami-0c9f90931dd48d1f2"
+ arn = (known after apply)
+ associate_public_ip_address = (known after apply)
+ availability_zone = (known after apply)
.................
+ revoke_rules_on_delete = false
+ tags_all = (known after apply)
+ vpc_id = "vpc-xxxxxxx"
}
Plan: 3 to add, 0 to change, 0 to destroy.
aws_key_pair.fis_test_instances: Creating...
aws_security_group.test-instance-sg: Creating...
aws_key_pair.fis_test_instances: Creation complete after 0s [id=fis-test-instances]
aws_security_group.test-instance-sg: Creation complete after 2s [id=sg-0bfeexxxxxx]
aws_instance.test: Creating...
aws_instance.test: Still creating... [10s elapsed]
aws_instance.test: Creation complete after 14s [id=i-08df1xxxxxx]
Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
Sweet, let's check that our instance is running a web server. You should be able to hit the public IP address in your browser.
Note: if the public IP address doesn't work, try copying it into a private window and prepending it with HTTP instead of HTTPS. Modern browsers rightly assume everything is encrypted these days, but the "open address" button in the AWS console prepends the URL with HTTPS as well.
You should see a running webserver in your browser, like below.
You would be surprised how many websites are running on a single EC2 instance, and this experiment will make it very clear that it's not a great way to run a website that you want people to actually visit and use.
With a configuration like this, if anything goes wrong on your instance or its plumbing, your website is dead.
Let's break stuff!
I got this far through the blog post and realised that Terraform doesn't support the Fault Injection Simulator yet... damn it! But, it does look like CloudFormation does, thankfully.
Let's define a FIS template and deploy it into our account to see if we can break our single instance.
Create faultinjection.yml
AWSTemplateFormatVersion: '2010-09-09'
Description: v1.0 FIS experiment template
Parameters:
EC2InstanceName:
Type: String
ConstraintDescription: Name of the EC2 Instances
Default: 'webserver-1'
Resources:
ExperimentTemplate:
Type: 'AWS::FIS::ExperimentTemplate'
Properties:
Actions:
StopInstances:
ActionId: 'aws:ec2:stop-instances'
Targets:
Instances: 'webservers'
Description: 'terminate ec2 instances'
RoleArn: !GetAtt 'Role.Arn'
Targets:
webservers:
ResourceTags:
'Name': !Ref EC2InstanceName
ResourceType: 'aws:ec2:instance'
SelectionMode: 'ALL'
StopConditions:
- Source: 'none'
Tags:
Purpose: Testing FIS
Role:
Type: 'AWS::IAM::Role'
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: 'fis.amazonaws.com'
Action: 'sts:AssumeRole'
Policies:
- PolicyName: fis
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: AllowFISExperimentRoleEC2Actions
Effect: Allow
Action:
- 'ec2:StopInstances'
- 'ec2:StartInstances'
Resource: !Sub 'arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:instance/*'
In this template we pass in the name for our EC2 instances (in our case
webserver-1
) and define the FIS template and IAM role that
gives it permissions to start / stop our instance.
The documentation is a little bit confusing on this service, I must
admit it took me about 30 minutes to figure out how to get this CloudFormation
template to do what I wanted. There is no example JSON and YAML in the
documentation like the other CloudFormation resources and some bits weren't
really documented at all (e.g., Targets: Instances:
), so it was
a bit of trial and error.
aws cloudformation create-stack --stack-name fis-test-template --template-body file://faultinjection.yml --capabilities CAPABILITY_IAM
{
"StackId": "arn:aws:cloudformation:ap-southeast-2:000000000000:stack/fis-test-template/f2ce5c00-4c8e-11ec-9c5e-xxxxxxxx"
}
Let's check out the CloudFormation console to make sure our stack deployed correctly.
Now let's check the FIS console to check our template exists, and get ready to run an experiment!
We can drill into our experiment and check out the target details. For
this one, we're targeting EC2 instances that have the name tag webserver-1
.
Let's make sure our website is still being served, but this time let's use the CLI to get rid of that pesky browser.
watch curl -I http://3.25.xx.xx/
This will check our website is still being served every 2 seconds and tell us if the response code is changing from 200 to something else.
Let's start the FIS experiment and see if our website goes down.
After a few seconds, we can see that the curl starts failing!
That must mean that the experiment finished and our instance is stopped.
Clearly our single instance doesn't cut it if we want to tolerate some level of failure in our EC2 instance infrastructure. There are a few things we can do to make it more resilient, and able to recover from failure.
Deploy another instance
One of the things we can do is deploy two instances instead of one, and sit a load balancer in front of it. This means we can send traffic to the load balancer and route traffic to a healthy instance if one becomes unavailable.
Let's make some changes to our Terraform to deploy an extra instance and put them both behind a load balancer.
Update the variables in variables.tf
variable "aws_profile" {
type = string
description = "The name of your AWS profile - can be loaded from environment variables"
}
variable "public_key" {
type = string
description = "The SSH public key from the pair you'll use to connect to your instances - can be loaded from environment variables"
}
variable "vpc_id" {
type = string
description = "VPC to launch the EC2 instance in"
}
variable "public_subnet_a" {
type = string
description = "Subnet ID of public-subnet-a"
}
variable "public_subnet_b" {
type = string
description = "Subnet ID of public-subnet-b"
}
Notice the extra subnet - if we're using AWS Application Load Balancers, we need provide it with multiple subnets.
Add your extra subnet ID into terraform.tfvars
aws_profile="YOUR_PROFILE_NAME"
public_key="ssh-rsa YOUR_PUBLIC_KEY"
vpc_id="YOUR_VPC_ID"
public_subnet_a="YOUR_SUBNET_ID"
public_subnet_b="YOUR_SUBNET_ID"
Add the load balancer and extra instances to main.tf
resource "aws_key_pair" "fis_test_instances" {
key_name = "fis-test-instances"
public_key = var.public_key
}
data "aws_ami" "ubuntu" {
most_recent = true
filter {
name = "name"
values = ["amzn2-ami-kernel-5.10-hvm-2.0.20211103.1-x86_64-gp2"]
}
owners = ["137112412989"]
}
data "http" "icanhazip" {
url = "http://icanhazip.com"
}
resource "aws_security_group" "test-instance-sg" {
name = "allow-all-from-lb-sg"
description = "Allow all inbound traffic from my IP only"
vpc_id = var.vpc_id
ingress {
from_port = 0
to_port = 0
protocol = "-1"
security_groups = [aws_security_group.test-instance-lb.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "test-instance-lb" {
name = "allow-all-from-my-ip"
description = "Allow all inbound traffic from my IP only"
vpc_id = var.vpc_id
ingress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = [
"${chomp(data.http.icanhazip.body)}/32"
]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_instance" "fis-test-instance-1" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.micro"
key_name = aws_key_pair.fis_test_instances.key_name
vpc_security_group_ids = ["${aws_security_group.test-instance-sg.id}"]
subnet_id = var.public_subnet_a
user_data = file("bootstrap.sh")
tags = {
Name = "webserver-1"
}
}
resource "aws_instance" "fis-test-instance-2" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.micro"
key_name = aws_key_pair.fis_test_instances.key_name
vpc_security_group_ids = ["${aws_security_group.test-instance-sg.id}"]
subnet_id = var.public_subnet_b
user_data = file("bootstrap.sh")
tags = {
Name = "webserver-2"
}
}
resource "aws_lb" "fis-test" {
name = "fis-test-lb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.test-instance-lb.id]
subnets = [var.public_subnet_a, var.public_subnet_b]
}
resource "aws_alb_listener" "alb_listener" {
load_balancer_arn = aws_lb.fis-test.arn
port = 80
protocol = "HTTP"
default_action {
target_group_arn = aws_lb_target_group.fis-test.arn
type = "forward"
}
}
resource "aws_lb_target_group" "fis-test" {
name = "fis-test-tg"
port = 80
protocol = "HTTP"
target_type = "instance"
vpc_id = var.vpc_id
health_check {
enabled = true
path = "/"
timeout = 10
}
}
resource "aws_lb_target_group_attachment" "fis-test-instance-1" {
target_group_arn = aws_lb_target_group.fis-test.arn
target_id = aws_instance.fis-test-instance-1.id
port = 80
}
resource "aws_lb_target_group_attachment" "fis-test-instance-2" {
target_group_arn = aws_lb_target_group.fis-test.arn
target_id = aws_instance.fis-test-instance-2.id
port = 80
}
We've defined a new instance, a load balancer, a target group, a listener and have assigned the instances to the target group so it can route requests to them.
Let's apply these changes (without a plan, naughty!).
terraform apply -auto-approve
aws_key_pair.fis_test_instances: Refreshing state... [id=fis-test-instances]
aws_security_group.test-instance-sg: Refreshing state... [id=sg-081dbbf0xxxxxxb9]
aws_instance.fis-test-instance-1: Refreshing state... [id=i-084c06xxxxxxxa3]
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# aws_alb_listener.alb_listener will be created
+ resource "aws_alb_listener" "alb_listener" {
+ arn = (known after apply)
+ id = (known after apply)
+ load_balancer_arn = (known after apply)
...................
aws_lb.fis-test: Still creating... [1m0s elapsed]
aws_lb.fis-test: Creation complete after 1m3s [id=arn:aws:elasticloadbalancing:ap-southeast-2:0000000000:loadbalancer/app/fis-test-lb/2a316bxxxx8d25]
aws_alb_listener.alb_listener: Creating...
aws_alb_listener.alb_listener: Creation complete after 0s [id=arn:aws:elasticloadbalancing:ap-southeast-2:0000000000:listener/app/fis-test-lb/2a316xxxxx25/a40ddxxxxxfe8]
Apply complete! Resources: 6 added, 0 changed, 0 destroyed.
Now we should see a couple of instances and a load balancer that forwards requests on to our web servers. Traffic will be being split between the two instances, and will be routed to the healthy instances of the associated target group.
And when we hit the public DNS of the load balancer we should see our website being served successfully.
Great, let's try and re-run our Fault Injection Simulation! Remember, it's
only targeting instances with the name tag webserver-1
, so if
things go to plan we should see one instance stop, but the other continue to
run and serve traffic successfully.
Now when we run our watch curl
we shouldn't see any downtime
because the load balancer only routes traffic to healthy hosts.
No downtime! And we can see that one of our instances has been stopped as expected.
So what happens if our instance gets terminated instead of stopped? There's no coming back from that unless we have some mechanism in place that'll replace the instances that stop / get terminated.
Autoscaling groups
We can go one step further and deploy our EC2 instances using AWS autoscale groups. In doing so, we can define a minimum number of healthy instances we want to have in the target group, which will mean AWS automatically replaces instances that get stopped / terminated.
Let's update our main.tf to include an autoscale group
resource "aws_key_pair" "fis_test_instances" {
key_name = "fis-test-instances"
public_key = var.public_key
}
data "http" "icanhazip" {
url = "http://icanhazip.com"
}
resource "aws_security_group" "test-instance-sg" {
name = "allow-all-from-lb-sg"
description = "Allow all inbound traffic from my IP only"
vpc_id = var.vpc_id
ingress {
from_port = 0
to_port = 0
protocol = "-1"
security_groups = [aws_security_group.test-instance-lb.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "test-instance-lb" {
name = "allow-all-from-my-ip"
description = "Allow all inbound traffic from my IP only"
vpc_id = var.vpc_id
ingress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = [
"${chomp(data.http.icanhazip.body)}/32"
]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_lb" "fis-test" {
name = "fis-test-lb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.test-instance-lb.id]
subnets = [var.public_subnet_a, var.public_subnet_b]
}
resource "aws_alb_listener" "alb_listener" {
load_balancer_arn = aws_lb.fis-test.arn
port = 80
protocol = "HTTP"
default_action {
target_group_arn = aws_lb_target_group.fis-test.arn
type = "forward"
}
}
resource "aws_lb_target_group" "fis-test" {
name = "fis-test-tg"
port = 80
protocol = "HTTP"
target_type = "instance"
vpc_id = var.vpc_id
health_check {
enabled = true
path = "/"
timeout = 10
}
}
data "aws_ami" "amazon_linux" {
most_recent = true
filter {
name = "name"
values = ["amzn2-ami-kernel-5.10-hvm-2.0.20211103.1-x86_64-gp2"]
}
owners = ["137112412989"]
}
resource "aws_launch_configuration" "fis-test" {
name = "fis-test-lc"
image_id = data.aws_ami.amazon_linux.id
instance_type = "t3.micro"
key_name = aws_key_pair.fis_test_instances.key_name
security_groups = [aws_security_group.test-instance-sg.id]
user_data = file("bootstrap.sh")
}
resource "aws_autoscaling_group" "fis-test" {
name = "fis-test"
depends_on = [aws_launch_configuration.fis-test]
vpc_zone_identifier = [var.public_subnet_a, var.public_subnet_b]
max_size = 4
min_size = 2
health_check_grace_period = 60
health_check_type = "ELB"
desired_capacity = 2
force_delete = true
launch_configuration = aws_launch_configuration.fis-test.id
target_group_arns = [aws_lb_target_group.fis-test.arn]
tag {
key = "Name"
value = "webserver-1"
propagate_at_launch = true
}
}
Notice the removal of the instances and the addition of a launch configuration template and an autoscaling group. We hook the autoscaling group up to the target group and tell it to use the load balancers health checks to decide when a new instance needs launched.
Let's deploy the latest code!
terraform apply -auto-approve
aws_lb_target_group_attachment.fis-test-instance-1: Refreshing state... [id=arn:aws:elasticloadbalancing:ap-southeast-2:046xxxxx2:targetgroup/fis-test-tg/d9554xxxc-2021112321xxx0000001]
aws_lb_target_group_attachment.fis-test-instance-2: Refreshing state... [id=arn:aws:elasticloadbalancing:ap-southeast-2:046xxxxxx2:targetgroup/fis-test-tg/dxxxb4f5a7be9c-2021112321492522800xx
aws_key_pair.fis_test_instances: Refreshing state... [id=fis-test-instances]
aws_instance.fis-test-instance-2: Refreshing state... [id=i-0d294766xxxxd]
aws_security_group.test-instance-lb: Refreshing state... [id=sg-0cb3bxxxxxfd0]
aws_lb_target_group.fis-test: Refreshing state... [id=arn:aws:elasticloadbalancing:ap-southeast-2:0xxxxxx8452:targetgroup/fis-test-tg/d9xxxbe9c]
aws_security_group.test-instance-sg: Refreshing state... [id=sg-081xxxxxx39b9]
aws_lb.fis-test: Refreshing state... [id=arn:aws:elasticloadbalancing:ap-southeast-2:046xxxxxxx2:loadbalancer/app/fis-test-lb/2axxxx98d25]
aws_alb_listener.alb_listener: Refreshing state... [id=arn:aws:elasticloadbalancing:ap-southeast-2:04xxxxx2:listener/app/fis-test-lb/2axxxxxx159f98d25/a4xxxxx3f7e68c0fe8]
Note: Objects have changed outside of Terraform
Terraform detected the following changes made outside of Terraform since the last "terraform apply":
# aws_instance.fis-test-instance-1 has been changed
~ resource "aws_instance" "fis-test-instance-1" {
~ associate_public_ip_address = true -> false
id = "i-084xxxxxxxa3"
................
aws_autoscaling_group.fis-test: Still creating... [1m0s elapsed]
aws_autoscaling_group.fis-test: Still creating... [1m10s elapsed]
aws_autoscaling_group.fis-test: Still creating... [1m20s elapsed]
aws_autoscaling_group.fis-test: Creation complete after 1m23s [id=fis-test]
Apply complete! Resources: 2 added, 0 changed, 4 destroyed.
Great, now we can check that our autoscaling group has been created successfully, and our instances are launched and connected to the target group correctly.
Notice the autoscale group tag config now calls all of our instances
webserver-1
. Our FIS template as it's written at the moment will
attempt to stop all instances as part of the experiment. Let's up the ante
a little bit and change the template to terminate the instances instead of
stop them.
Update faultinjection.yml to terminate instead of stop
AWSTemplateFormatVersion: '2010-09-09'
Description: v1.0 FIS experiment template
Parameters:
EC2InstanceName:
Type: String
ConstraintDescription: Name of the EC2 Instances
Default: 'webserver-1'
Resources:
ExperimentTemplate:
Type: 'AWS::FIS::ExperimentTemplate'
Properties:
Actions:
StopInstances:
ActionId: 'aws:ec2:terminate-instances'
Targets:
Instances: 'webservers'
Description: 'terminate ec2 instances'
RoleArn: !GetAtt 'Role.Arn'
Targets:
webservers:
ResourceTags:
'Name': !Ref EC2InstanceName
ResourceType: 'aws:ec2:instance'
Filters:
- Path: "State.Name"
Values: ["running"]
SelectionMode: 'ALL'
StopConditions:
- Source: 'none'
Tags:
Purpose: Testing FIS
Role:
Type: 'AWS::IAM::Role'
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: 'fis.amazonaws.com'
Action: 'sts:AssumeRole'
Policies:
- PolicyName: fis
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: AllowFISExperimentRoleEC2Actions
Effect: Allow
Action:
- 'ec2:StopInstances'
- 'ec2:StartInstances'
- 'ec2:TerminateInstances'
Resource: !Sub 'arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:instance/*'
I also needed to add a filter in to the target section to only target running instances because you can't target more than 5 instances at once as part of an experiment. I had a few terminated instances sitting around that were also being targeted. Fair enough, you probably don't want to accidentally terminate your whole fleet of 100 production instances!
aws cloudformation update-stack --stack-name fis-test-template --template-body file://faultinjection.yml --capabilities CAPABILITY_IAM
{
"StackId": "arn:aws:cloudformation:ap-southeast-2:000000000000:stack/fis-test-template/f2ce5c00-4c8e-11ec-9c5e-xxxxxxxx"
}
Let's start the FIS experiment again and see if our instances get terminated and auto-heal themselves.
Using the watch curl
command we can see that even though
all of the instances serving our application are getting terminated,
we still only had around 2 minutes of downtime while the autoscale group
provisioned some more instances. Not bad! But clearly not good enough
for a production stack.
Some other things we might look at doing to strengthen our production stack:
- Launch more instances behind the load balancer
- Enable termination protection
- Add some better monitoring / alerting for observability
- Get better insights into what's happening on our production machines
Game Day Applications
I think this is a good service to utilise during game days and training events. Forcing your staff into building resilient architecture is a great way to prepare them for actual production outages.
There is a heap more that this service can do, and a heap more we can do to make our services resilient, but this post has ended up much longer than I expected it to! I might have to revisit this and explore other functionality another day.