AWS Yarns

A blog about AWS things.

Troubleshooting AWS Config Bill Shock

Posted by Chris McKinnel - 14 September 2021
6 minute read

I was granted access to the AWS accounts of one of my newest customers today, and the first thing I usually do is have a look around the account to see what I'm dealing with. It doesn't take long to head over to the billing console and then work backwards from the line items in there.

One of the things that jumped out at me was the amount this customer was paying for AWS Config - it totalled just over 10% of their total bill. Hmm, that didn't look right to me, even though I have seen some sizeable Config bills in my time.

Looking a little closer, the total number of AWS Config changes recorded was about 500,000 in a single month. That's a lot of changes! Definitely something not right.

Investigation

This customer had about 15 accounts, and at first glance they had AWS services enabled in multiple regions. Clicking through each account and trying to figure out which rules were being triggered wasn't an option, but I remembered that Config is backed by S3, and you can use Athena to query structured data in S3.

I requested access to Athena in the Log Archive account (where the Config logs were being aggregated to), and created a table that mapped to the Config structure.

CREATE EXTERNAL TABLE awsconfig (
         fileversion string,
         configSnapshotId string,
         configurationitems ARRAY < STRUCT < configurationItemVersion : STRING,
         configurationItemCaptureTime : STRING,
         configurationStateId : BIGINT,
         awsAccountId : STRING,
         configurationItemStatus : STRING,
         resourceType : STRING,
         resourceId : STRING,
         resourceName : STRING,
         ARN : STRING,
         awsRegion : STRING,
         availabilityZone : STRING,
         configurationStateMd5Hash : STRING,
         resourceCreationTime : STRING > >
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 's3://aws-controltower-logs-222222222222-ap-southeast-2/o-xxxxxxx/AWSLogs/1111111111111/Config/ap-southeast-2/';

I then copied a query from an AWS blog that would spit out which config items were created in the month you specified.


SELECT configurationItem.resourceType,
         configurationItem.resourceId,
         COUNT(configurationItem.resourceId) AS NumberOfChanges
FROM default.awsconfig
CROSS JOIN UNNEST(configurationitems) AS t(configurationItem)
WHERE "$path" LIKE '%ConfigHistory%'
        AND configurationItem.configurationItemCaptureTime >= '2021-07-01T%'
        AND configurationItem.configurationItemCaptureTime <= '2021-07-31T%'
GROUP BY  configurationItem.resourceType, configurationItem.resourceId
ORDER BY  NumberOfChanges DESC

This worked well, I could now see the configuration items for the account and the region that I specified. The problem was that it was only for that account, and only for that region. I needed to figure out how I could query the data across accounts, and across regions at the same time.

Screen shot of Athena query output.

The bucket structure is split into sub-folders for accounts, service type and region, and I needed a way to have Athena traverse all possible options for accounts and regions.

Screen shot of AWS Config S3 bucket structure.

Enter Athena Partitions

Shout out to Matt Johnston who showed me how to do this! Matt is one of our Senior Cloud DevOps Engineers at CCL, and is an absolute wealth of knowledge when it comes to all things AWS.

By telling Athena that we wanted to create some partitions when we create the table, we can then tell it where to look for the data in the partition. This allows us to tell Athena that it should look in multiple folders in the S3 bucket, where the folder name in the bucket is the partition key.

CREATE EXTERNAL TABLE awsconfig (
         fileversion string,
         configSnapshotId string,
         configurationitems ARRAY < STRUCT < configurationItemVersion : STRING,
         configurationItemCaptureTime : STRING,
         configurationStateId : BIGINT,
         awsAccountId : STRING,
         configurationItemStatus : STRING,
         resourceType : STRING,
         resourceId : STRING,
         resourceName : STRING,
         ARN : STRING,
         awsRegion : STRING,
         availabilityZone : STRING,
         configurationStateMd5Hash : STRING,
         resourceCreationTime : STRING > >
)
PARTITIONED BY (account string, region string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 's3://aws-controltower-logs-222222222222-ap-southeast-2/o-xxxxxxx/AWSLogs/1111111111111/Config/ap-southeast-2/';

And then adding the location for the partitions:

ALTER TABLE awsconfig ADD PARTITION (account='1111111111111', region='ap-southeast-2') location 's3://aws-controltower-logs-2222222222222-ap-southeast-2/o-xxxxxxx/AWSLogs/11111111111111/Config/ap-southeast-2/';

Note that we give Athena the account number, and then we show it where to look for the data. I also decided that the likelihood of this issue being in Sydney was high, so I didn't bother adding the partitions for the region.

I'm slightly embarassed to admit I didn't write a script to do this for all 15 accounts - I changed the account numbers one by one and ran each alter table query manually... For shame. If I had to add the other regions, I would have written a Python script to do this for me, for sure.

Querying all accounts

With the new partitions set up, I could run the exact same query and Athena does all the hard work for me!


SELECT configurationItem.resourceType,
         configurationItem.resourceId,
         COUNT(configurationItem.resourceId) AS NumberOfChanges
FROM default.awsconfig
CROSS JOIN UNNEST(configurationitems) AS t(configurationItem)
WHERE "$path" LIKE '%ConfigHistory%'
        AND configurationItem.configurationItemCaptureTime >= '2021-07-01T%'
        AND configurationItem.configurationItemCaptureTime <= '2021-07-31T%'
GROUP BY  configurationItem.resourceType, configurationItem.resourceId
ORDER BY  NumberOfChanges DESC

Now instead of just seeing the data for a single account, we can see it for all accounts.

Screen shot of Athena query output.

But which accounts exactly? I needed to add the account number to the query so I could see where exactly I needed to look to troubleshoot further.


SELECT configurationItem.resourceType,
         configurationItem.resourceId,
         COUNT(configurationItem.resourceId) AS NumberOfChanges,
         account,
         region
FROM default.awsconfig
CROSS JOIN UNNEST(configurationitems) AS t(configurationItem)
WHERE "$path" LIKE '%ConfigHistory%'
        AND configurationItem.configurationItemCaptureTime >= '2021-08-01T%'
        AND configurationItem.configurationItemCaptureTime <= '2021-08-31T%'
GROUP BY  configurationItem.resourceType, configurationItem.resourceId, account, region
ORDER BY  NumberOfChanges DESC

Screen shot of Athena query output.

It turns out what was happening was a Config rule created by Security Hub to check that instances managed by SSM were associated to SSM at all times was being flipped between compliant and non-compliant thousands of times a month for each managed instance (why, you might ask? No idea yet, that's for one of my super smart AWS nerds to figure out).

By using Athena and leveraging partitions, I was able to query the data in minutes to uncover a 5-figure saving for this customer. Pretty neat!

Summary

It's easy to assume that the services AWS recommend everybody enable for maximum protection are just doing what they should, and you can set and forget them and sleep peacefully knowing you're protected from the bad guys.

The reality is you should be checking your AWS bill every single time you enable a new service, and make sure you can reconcile the amount that shows up on your bill with what it is you expected to show up there.