Chaos Engineering with the AWS Fault Injection Simulator

Posted by Chris McKinnel - 23 November 2021
12 minute read

I was browsing the AWS RSS news feeds the other day and I happened to notice a service I'd never heard of called AWS Fault Injection Simulator. What? How had I not heard of this? I felt a tiny bit better when I saw it was released in March 2021, but I'm still 6 months behind the 8-ball.

It turns out this actually happens to me quite frequently, which is terrifying considering I spend a good chunk of every day working with and talking about AWS to my colleagues and customers. How do people keep up with all the new services that get released? It's like they actually have time to read the AWS news... anyway, that's a topic for another day.

I had a quick look into this new service, and as expected it's basically Chaos Engineering as a Service - neat! I figured I'd better kick the tires of this thing and see how it could help my customers.

Chaos Engineering

But first, what exactly is Chaos Engineering? Wikipedia does a decent job of defining it for us:

Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions.

So it's like of kind letting a gaggle of junior developers and engineers loose on your production environment, eventually one of them will accidentally turn some stuff off that they weren't meant to. Except with this service, you get your very own automated gaggle of juniors clicking buttons and deleting things on-demand.

I should also say I've got nothing against juniors... they just break stuff every now and again. When I was a junior I truncated some tables in a production database and then went to lunch. I was a Chaos Engineer before that term had even been coined.

Getting set up for a Chaos test

Before we get stuck into the Fault Injection Simulator, we need some resources deployed that we can inject faults into. It looks like FIS supports the following resource types:

So let's get an EC2 instance deployed and running a simple web server to see if we can inject a fault and break it. I decided to use Terraform to deploy my instances, if you want to follow along you can deploy them however you like.

First, create a providers.tf

provider "aws" {
  region  = "ap-southeast-2"
  profile = var.aws_profile
}

This tells Terraform to use AWS as the provider.

Then, create a variables.tf

variable "aws_profile" {
  type        = string
  description = "The name of your AWS profile - can be loaded from environment variables"
}

variable "public_key" {
  type        = string
  description = "The SSH public key from the pair you'll use to connect to your instances - can be loaded from environment variables"
}

variable "vpc_id" {
  type        = string
  description = "VPC to launch the EC2 instance in"
}

variable "public_subnet_a" {
  type        = string
  description = "Subnet ID of public-subnet-a"
}

Let's define the values of these variables in terraform.tfvars

aws_profile="YOUR_PROFILE_NAME"
public_key="ssh-rsa YOUR_PUBLIC_KEY"
vpc_id="YOUR_VPC_ID"
public_subnet_a="YOUR_SUBNET_ID"

If you've got your local AWS CLI set up correctly, you should be able to get the name of your AWS profile from either ~/.aws/config or ~/.aws/credentials. If you haven't got your CLI set up, check out the AWS documentation.

If you've got SSH keys set up, you'll probably find your public key in ~/.ssh/id_rsa.pub or similar. If you haven't got local SSH keys setup, check out this helpful walkthrough.

Your AWS account should have some default VPCs and subnets that you can use for this demo. If you've deleted these and deployed your own, grab the ID of your VPC and one of your public subnets.

And create bootstrap.sh

#!/bin/sh
yum -y install httpd php mysql php-mysql

case $(ps -p 1 -o comm | tail -1) in
systemd) systemctl enable --now httpd ;;
init) chkconfig httpd on; service httpd start ;;
*) echo "Error starting httpd (OS not using init or systemd)." 2>&1
esac

if [ ! -f /var/www/html/bootcamp-app.tar.gz ]; then
cd /var/www/html
wget https://s3.amazonaws.com/immersionday-labs/bootcamp-app.tar
tar xvf bootcamp-app.tar
chown apache:root /var/www/html/rds.conf.php
fi
yum -y update

This will install a simple apache webserver - I actually borrowed this from the AWS EC2 Immersion Day labs.

Finally we're ready to create main.tf

resource "aws_key_pair" "fis_test_instances" {
  key_name   = "fis-test-instances"
  public_key = var.public_key
}

data "aws_ami" "ubuntu" {
  most_recent = true

  filter {
    name   = "name"
    values = ["amzn2-ami-kernel-5.10-hvm-2.0.20211103.1-x86_64-gp2"]
  }

  owners = ["137112412989"]
}

data "http" "icanhazip" {
  url = "http://icanhazip.com"
}

resource "aws_security_group" "test-instance-sg" {
  name        = "allow-all-from-my-ip"
  description = "Allow all inbound traffic from my IP only"
  vpc_id      = var.vpc_id

  ingress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"
    cidr_blocks = [
      "${chomp(data.http.icanhazip.body)}/32"
    ]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_instance" "test" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = "t3.micro"
  key_name               = aws_key_pair.fis_test_instances.key_name
  vpc_security_group_ids = ["${aws_security_group.test-instance-sg.id}"]
  subnet_id              = var.public_subnet_a
  user_data              = file("bootstrap.sh")

  tags = {
    Name = "webserver-1"
  }
}

We're deploying a super simple EC2 instance, telling it to use bootstrap.sh as our user data, and setting up a security group to allow all traffic from our own IP address only.

We should now be able to run a terraform plan and a terraform apply to deploy our instance.

terraform plan

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_instance.test will be created
  + resource "aws_instance" "test" {
      + ami                                  = "ami-0c9f90931dd48d1f2"
      + arn                                  = (known after apply)
      + associate_public_ip_address          = (known after apply)

 ..............
 
 
       + tags_all               = (known after apply)
      + vpc_id                 = "vpc-xxxxx"
    }

Plan: 3 to add, 0 to change, 0 to destroy.

─────────────────────────────────────────────────────────────────────────────

Note: You didn't use the -out option to save this plan, so Terraform can't
guarantee to take exactly these actions if you run "terraform apply" now.

Great, looks like we can run terraform apply!

terraform plan -auto-approve

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_instance.test will be created
  + resource "aws_instance" "test" {
      + ami                                  = "ami-0c9f90931dd48d1f2"
      + arn                                  = (known after apply)
      + associate_public_ip_address          = (known after apply)
      + availability_zone                    = (known after apply)
      
.................
      

       + revoke_rules_on_delete = false
      + tags_all               = (known after apply)
      + vpc_id                 = "vpc-xxxxxxx"
    }

Plan: 3 to add, 0 to change, 0 to destroy.
aws_key_pair.fis_test_instances: Creating...
aws_security_group.test-instance-sg: Creating...
aws_key_pair.fis_test_instances: Creation complete after 0s [id=fis-test-instances]
aws_security_group.test-instance-sg: Creation complete after 2s [id=sg-0bfeexxxxxx]
aws_instance.test: Creating...
aws_instance.test: Still creating... [10s elapsed]
aws_instance.test: Creation complete after 14s [id=i-08df1xxxxxx]

Apply complete! Resources: 3 added, 0 changed, 0 destroyed.

Sweet, let's check that our instance is running a web server. You should be able to hit the public IP address in your browser.

Note: if the public IP address doesn't work, try copying it into a private window and prepending it with HTTP instead of HTTPS. Modern browsers rightly assume everything is encrypted these days, but the "open address" button in the AWS console prepends the URL with HTTPS as well.

You should see a running webserver in your browser, like below.

You would be surprised how many websites are running on a single EC2 instance, and this experiment will make it very clear that it's not a great way to run a website that you want people to actually visit and use.

With a configuration like this, if anything goes wrong on your instance or its plumbing, your website is dead.

Let's break stuff!

I got this far through the blog post and realised that Terraform doesn't support the Fault Injection Simulator yet... damn it! But, it does look like CloudFormation does, thankfully.

Let's define a FIS template and deploy it into our account to see if we can break our single instance.

Create faultinjection.yml

AWSTemplateFormatVersion: '2010-09-09'
Description: v1.0 FIS experiment template
Parameters:
  EC2InstanceName:
    Type: String
    ConstraintDescription: Name of the EC2 Instances
    Default: 'webserver-1'
Resources:
  ExperimentTemplate:
    Type: 'AWS::FIS::ExperimentTemplate'
    Properties:
      Actions:
        StopInstances:
          ActionId: 'aws:ec2:stop-instances'
          Targets:
            Instances: 'webservers'
      Description: 'terminate ec2 instances'
      RoleArn: !GetAtt 'Role.Arn'
      Targets:
        webservers:
          ResourceTags:
            'Name': !Ref EC2InstanceName
          ResourceType: 'aws:ec2:instance'
          SelectionMode: 'ALL'
      StopConditions:
      - Source: 'none'
      Tags:
        Purpose: Testing FIS

  Role:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
        - Effect: Allow
          Principal:
            Service: 'fis.amazonaws.com'
          Action: 'sts:AssumeRole'
      Policies:
      - PolicyName: fis
        PolicyDocument:
          Version: '2012-10-17'
          Statement:
          - Sid: AllowFISExperimentRoleEC2Actions
            Effect: Allow
            Action:
            - 'ec2:StopInstances'
            - 'ec2:StartInstances'
            Resource: !Sub 'arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:instance/*'

In this template we pass in the name for our EC2 instances (in our case webserver-1) and define the FIS template and IAM role that gives it permissions to start / stop our instance.

The documentation is a little bit confusing on this service, I must admit it took me about 30 minutes to figure out how to get this CloudFormation template to do what I wanted. There is no example JSON and YAML in the documentation like the other CloudFormation resources and some bits weren't really documented at all (e.g., Targets: Instances:), so it was a bit of trial and error.

aws cloudformation create-stack --stack-name fis-test-template --template-body file://faultinjection.yml --capabilities CAPABILITY_IAM
{
    "StackId": "arn:aws:cloudformation:ap-southeast-2:000000000000:stack/fis-test-template/f2ce5c00-4c8e-11ec-9c5e-xxxxxxxx"
}

Let's check out the CloudFormation console to make sure our stack deployed correctly.

Now let's check the FIS console to check our template exists, and get ready to run an experiment!

We can drill into our experiment and check out the target details. For this one, we're targeting EC2 instances that have the name tag webserver-1.

Let's make sure our website is still being served, but this time let's use the CLI to get rid of that pesky browser.

watch curl -I http://3.25.xx.xx/

This will check our website is still being served every 2 seconds and tell us if the response code is changing from 200 to something else.

Let's start the FIS experiment and see if our website goes down.

After a few seconds, we can see that the curl starts failing!

That must mean that the experiment finished and our instance is stopped.

Clearly our single instance doesn't cut it if we want to tolerate some level of failure in our EC2 instance infrastructure. There are a few things we can do to make it more resilient, and able to recover from failure.

Deploy another instance

One of the things we can do is deploy two instances instead of one, and sit a load balancer in front of it. This means we can send traffic to the load balancer and route traffic to a healthy instance if one becomes unavailable.

Let's make some changes to our Terraform to deploy an extra instance and put them both behind a load balancer.

Update the variables in variables.tf

variable "aws_profile" {
  type        = string
  description = "The name of your AWS profile - can be loaded from environment variables"
}

variable "public_key" {
  type        = string
  description = "The SSH public key from the pair you'll use to connect to your instances - can be loaded from environment variables"
}

variable "vpc_id" {
  type        = string
  description = "VPC to launch the EC2 instance in"
}

variable "public_subnet_a" {
  type        = string
  description = "Subnet ID of public-subnet-a"
}

variable "public_subnet_b" {
  type        = string
  description = "Subnet ID of public-subnet-b"
}

Notice the extra subnet - if we're using AWS Application Load Balancers, we need provide it with multiple subnets.

Add your extra subnet ID into terraform.tfvars

aws_profile="YOUR_PROFILE_NAME"
public_key="ssh-rsa YOUR_PUBLIC_KEY"
vpc_id="YOUR_VPC_ID"
public_subnet_a="YOUR_SUBNET_ID"
public_subnet_b="YOUR_SUBNET_ID"

Add the load balancer and extra instances to main.tf

resource "aws_key_pair" "fis_test_instances" {
  key_name   = "fis-test-instances"
  public_key = var.public_key
}

data "aws_ami" "ubuntu" {
  most_recent = true

  filter {
    name   = "name"
    values = ["amzn2-ami-kernel-5.10-hvm-2.0.20211103.1-x86_64-gp2"]
  }

  owners = ["137112412989"]
}

data "http" "icanhazip" {
  url = "http://icanhazip.com"
}

resource "aws_security_group" "test-instance-sg" {
  name        = "allow-all-from-lb-sg"
  description = "Allow all inbound traffic from my IP only"
  vpc_id      = var.vpc_id

  ingress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"
    security_groups = [aws_security_group.test-instance-lb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group" "test-instance-lb" {
  name        = "allow-all-from-my-ip"
  description = "Allow all inbound traffic from my IP only"
  vpc_id      = var.vpc_id

  ingress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"
    cidr_blocks = [
      "${chomp(data.http.icanhazip.body)}/32"
    ]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_instance" "fis-test-instance-1" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = "t3.micro"
  key_name               = aws_key_pair.fis_test_instances.key_name
  vpc_security_group_ids = ["${aws_security_group.test-instance-sg.id}"]
  subnet_id              = var.public_subnet_a
  user_data              = file("bootstrap.sh")

  tags = {
    Name = "webserver-1"
  }
}

resource "aws_instance" "fis-test-instance-2" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = "t3.micro"
  key_name               = aws_key_pair.fis_test_instances.key_name
  vpc_security_group_ids = ["${aws_security_group.test-instance-sg.id}"]
  subnet_id              = var.public_subnet_b
  user_data              = file("bootstrap.sh")

  tags = {
    Name = "webserver-2"
  }
}

resource "aws_lb" "fis-test" {
  name               = "fis-test-lb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.test-instance-lb.id]
  subnets            = [var.public_subnet_a, var.public_subnet_b]
}

resource "aws_alb_listener" "alb_listener" {  
  load_balancer_arn = aws_lb.fis-test.arn
  port              = 80
  protocol          = "HTTP"
  
  default_action {    
    target_group_arn = aws_lb_target_group.fis-test.arn
    type             = "forward"
  }
}

resource "aws_lb_target_group" "fis-test" {
  name     = "fis-test-tg"
  port     = 80
  protocol = "HTTP"
  target_type = "instance"
  vpc_id   = var.vpc_id

  health_check {
    enabled = true
    path    = "/"
    timeout = 10
  }
}

resource "aws_lb_target_group_attachment" "fis-test-instance-1" {
  target_group_arn = aws_lb_target_group.fis-test.arn
  target_id        = aws_instance.fis-test-instance-1.id
  port             = 80
}

resource "aws_lb_target_group_attachment" "fis-test-instance-2" {
  target_group_arn = aws_lb_target_group.fis-test.arn
  target_id        = aws_instance.fis-test-instance-2.id
  port             = 80
}

We've defined a new instance, a load balancer, a target group, a listener and have assigned the instances to the target group so it can route requests to them.

Let's apply these changes (without a plan, naughty!).

terraform apply -auto-approve
aws_key_pair.fis_test_instances: Refreshing state... [id=fis-test-instances]
aws_security_group.test-instance-sg: Refreshing state... [id=sg-081dbbf0xxxxxxb9]
aws_instance.fis-test-instance-1: Refreshing state... [id=i-084c06xxxxxxxa3]

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_alb_listener.alb_listener will be created
  + resource "aws_alb_listener" "alb_listener" {
      + arn               = (known after apply)
      + id                = (known after apply)
      + load_balancer_arn = (known after apply)
      
...................

aws_lb.fis-test: Still creating... [1m0s elapsed]
aws_lb.fis-test: Creation complete after 1m3s [id=arn:aws:elasticloadbalancing:ap-southeast-2:0000000000:loadbalancer/app/fis-test-lb/2a316bxxxx8d25]
aws_alb_listener.alb_listener: Creating...
aws_alb_listener.alb_listener: Creation complete after 0s [id=arn:aws:elasticloadbalancing:ap-southeast-2:0000000000:listener/app/fis-test-lb/2a316xxxxx25/a40ddxxxxxfe8]

Apply complete! Resources: 6 added, 0 changed, 0 destroyed.

Now we should see a couple of instances and a load balancer that forwards requests on to our web servers. Traffic will be being split between the two instances, and will be routed to the healthy instances of the associated target group.

And when we hit the public DNS of the load balancer we should see our website being served successfully.

Great, let's try and re-run our Fault Injection Simulation! Remember, it's only targeting instances with the name tag webserver-1, so if things go to plan we should see one instance stop, but the other continue to run and serve traffic successfully.

Now when we run our watch curl we shouldn't see any downtime because the load balancer only routes traffic to healthy hosts.

No downtime! And we can see that one of our instances has been stopped as expected.

So what happens if our instance gets terminated instead of stopped? There's no coming back from that unless we have some mechanism in place that'll replace the instances that stop / get terminated.

Autoscaling groups

We can go one step further and deploy our EC2 instances using AWS autoscale groups. In doing so, we can define a minimum number of healthy instances we want to have in the target group, which will mean AWS automatically replaces instances that get stopped / terminated.

Let's update our main.tf to include an autoscale group

resource "aws_key_pair" "fis_test_instances" {
  key_name   = "fis-test-instances"
  public_key = var.public_key
}

data "http" "icanhazip" {
  url = "http://icanhazip.com"
}

resource "aws_security_group" "test-instance-sg" {
  name        = "allow-all-from-lb-sg"
  description = "Allow all inbound traffic from my IP only"
  vpc_id      = var.vpc_id

  ingress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"
    security_groups = [aws_security_group.test-instance-lb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group" "test-instance-lb" {
  name        = "allow-all-from-my-ip"
  description = "Allow all inbound traffic from my IP only"
  vpc_id      = var.vpc_id

  ingress {
    from_port = 0
    to_port   = 0
    protocol  = "-1"
    cidr_blocks = [
      "${chomp(data.http.icanhazip.body)}/32"
    ]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_lb" "fis-test" {
  name               = "fis-test-lb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.test-instance-lb.id]
  subnets            = [var.public_subnet_a, var.public_subnet_b]
}

resource "aws_alb_listener" "alb_listener" {  
  load_balancer_arn = aws_lb.fis-test.arn
  port              = 80
  protocol          = "HTTP"
  
  default_action {    
    target_group_arn = aws_lb_target_group.fis-test.arn
    type             = "forward"
  }
}

resource "aws_lb_target_group" "fis-test" {
  name     = "fis-test-tg"
  port     = 80
  protocol = "HTTP"
  target_type = "instance"
  vpc_id   = var.vpc_id

  health_check {
    enabled = true
    path    = "/"
    timeout = 10
  }
}

data "aws_ami" "amazon_linux" {
  most_recent = true

  filter {
    name   = "name"
    values = ["amzn2-ami-kernel-5.10-hvm-2.0.20211103.1-x86_64-gp2"]
  }

  owners = ["137112412989"]
}

resource "aws_launch_configuration" "fis-test" {
  name                 = "fis-test-lc"
  image_id             = data.aws_ami.amazon_linux.id
  instance_type        = "t3.micro"
  key_name               = aws_key_pair.fis_test_instances.key_name
  security_groups = [aws_security_group.test-instance-sg.id]
  user_data              = file("bootstrap.sh")
}

resource "aws_autoscaling_group" "fis-test" {
  name                      = "fis-test"
  depends_on                = [aws_launch_configuration.fis-test]
  vpc_zone_identifier       = [var.public_subnet_a, var.public_subnet_b]
  max_size                  = 4
  min_size                  = 2
  health_check_grace_period = 60
  health_check_type         = "ELB"
  desired_capacity          = 2
  force_delete              = true
  launch_configuration      = aws_launch_configuration.fis-test.id
  target_group_arns         = [aws_lb_target_group.fis-test.arn]
  tag {
    key                 = "Name"
    value               = "webserver-1"
    propagate_at_launch = true
  }
}

Notice the removal of the instances and the addition of a launch configuration template and an autoscaling group. We hook the autoscaling group up to the target group and tell it to use the load balancers health checks to decide when a new instance needs launched.

Let's deploy the latest code!

terraform apply -auto-approve
aws_lb_target_group_attachment.fis-test-instance-1: Refreshing state...  [id=arn:aws:elasticloadbalancing:ap-southeast-2:046xxxxx2:targetgroup/fis-test-tg/d9554xxxc-2021112321xxx0000001]            
aws_lb_target_group_attachment.fis-test-instance-2: Refreshing state...  [id=arn:aws:elasticloadbalancing:ap-southeast-2:046xxxxxx2:targetgroup/fis-test-tg/dxxxb4f5a7be9c-2021112321492522800xx
aws_key_pair.fis_test_instances: Refreshing state... [id=fis-test-instances]
aws_instance.fis-test-instance-2: Refreshing state... [id=i-0d294766xxxxd]
aws_security_group.test-instance-lb: Refreshing state... [id=sg-0cb3bxxxxxfd0]
aws_lb_target_group.fis-test: Refreshing state... [id=arn:aws:elasticloadbalancing:ap-southeast-2:0xxxxxx8452:targetgroup/fis-test-tg/d9xxxbe9c]
aws_security_group.test-instance-sg: Refreshing state... [id=sg-081xxxxxx39b9]
aws_lb.fis-test: Refreshing state... [id=arn:aws:elasticloadbalancing:ap-southeast-2:046xxxxxxx2:loadbalancer/app/fis-test-lb/2axxxx98d25]
aws_alb_listener.alb_listener: Refreshing state... [id=arn:aws:elasticloadbalancing:ap-southeast-2:04xxxxx2:listener/app/fis-test-lb/2axxxxxx159f98d25/a4xxxxx3f7e68c0fe8]

Note: Objects have changed outside of Terraform

Terraform detected the following changes made outside of Terraform since the last "terraform apply":

  # aws_instance.fis-test-instance-1 has been changed
  ~ resource "aws_instance" "fis-test-instance-1" {
      ~ associate_public_ip_address          = true -> false
        id                                   = "i-084xxxxxxxa3"
        
        
................


aws_autoscaling_group.fis-test: Still creating... [1m0s elapsed]
aws_autoscaling_group.fis-test: Still creating... [1m10s elapsed]
aws_autoscaling_group.fis-test: Still creating... [1m20s elapsed]
aws_autoscaling_group.fis-test: Creation complete after 1m23s [id=fis-test]

Apply complete! Resources: 2 added, 0 changed, 4 destroyed.

Great, now we can check that our autoscaling group has been created successfully, and our instances are launched and connected to the target group correctly.

Notice the autoscale group tag config now calls all of our instances webserver-1. Our FIS template as it's written at the moment will attempt to stop all instances as part of the experiment. Let's up the ante a little bit and change the template to terminate the instances instead of stop them.

Update faultinjection.yml to terminate instead of stop

AWSTemplateFormatVersion: '2010-09-09'
Description: v1.0 FIS experiment template
Parameters:
  EC2InstanceName:
    Type: String
    ConstraintDescription: Name of the EC2 Instances
    Default: 'webserver-1'
Resources:
  ExperimentTemplate:
    Type: 'AWS::FIS::ExperimentTemplate'
    Properties:
      Actions:
        StopInstances:
          ActionId: 'aws:ec2:terminate-instances'
          Targets:
            Instances: 'webservers'
      Description: 'terminate ec2 instances'
      RoleArn: !GetAtt 'Role.Arn'
      Targets:
        webservers:
          ResourceTags:
            'Name': !Ref EC2InstanceName
          ResourceType: 'aws:ec2:instance'
          Filters:
          - Path: "State.Name"
            Values: ["running"]
          SelectionMode: 'ALL'
      StopConditions:
      - Source: 'none'
      Tags:
        Purpose: Testing FIS

  Role:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
        - Effect: Allow
          Principal:
            Service: 'fis.amazonaws.com'
          Action: 'sts:AssumeRole'
      Policies:
      - PolicyName: fis
        PolicyDocument:
          Version: '2012-10-17'
          Statement:
          - Sid: AllowFISExperimentRoleEC2Actions
            Effect: Allow
            Action:
            - 'ec2:StopInstances'
            - 'ec2:StartInstances'
            - 'ec2:TerminateInstances'
            Resource: !Sub 'arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:instance/*'

I also needed to add a filter in to the target section to only target running instances because you can't target more than 5 instances at once as part of an experiment. I had a few terminated instances sitting around that were also being targeted. Fair enough, you probably don't want to accidentally terminate your whole fleet of 100 production instances!

aws cloudformation update-stack --stack-name fis-test-template --template-body file://faultinjection.yml --capabilities CAPABILITY_IAM
{
    "StackId": "arn:aws:cloudformation:ap-southeast-2:000000000000:stack/fis-test-template/f2ce5c00-4c8e-11ec-9c5e-xxxxxxxx"
}

Let's start the FIS experiment again and see if our instances get terminated and auto-heal themselves.

Using the watch curl command we can see that even though all of the instances serving our application are getting terminated, we still only had around 2 minutes of downtime while the autoscale group provisioned some more instances. Not bad! But clearly not good enough for a production stack.

Some other things we might look at doing to strengthen our production stack:

Launch more instances behind the load balancer
Enable termination protection
Add some better monitoring / alerting for observability
Get better insights into what's happening on our production machines

Game Day Applications

I think this is a good service to utilise during game days and training events. Forcing your staff into building resilient architecture is a great way to prepare them for actual production outages.

There is a heap more that this service can do, and a heap more we can do to make our services resilient, but this post has ended up much longer than I expected it to! I might have to revisit this and explore other functionality another day.