Maintaining a cloud environment at scale is difficult even in the best of times. How can you monitor service level indicators while considering configuration settings across hundreds or thousands of services? Infrastructure as code (IaC) services like Terraform, CloudFormation, and Serverless Framework have allowed us to create declarative models of our desired state and, in an ideal world, maintain that state.
Imagine a scenario where you’ve embraced IaC. You’re loving the chaos and have actually heard a developer say “DevSecOps” once or twice. You’re even utilizing automated security measures like Checkov to scan your IaC for potential misconfigurations that may result in security or compliance issues down the line.
You did it! No need for any further monitoring—your cloud is 100% awesome!
Wouldn’t that be nice? Unfortunately, new security issues emerge daily and humans make mistakes. Sometimes, changes made under time pressure can create drift between our desired IaC state and our runtime state.
When things fall apart (in runtime)
When something does go amiss in runtime, our goal as SRE, security, and DevOps teams is to fix the problem in as little time as possible. The challenge we have now is that (thanks to the double-edged sword of IaC) everything is provisioned by our friendly Terraform service account and not the actual user who created the specification for the resource in question. So begins our detective work, starting with the creation of a Jira ticket assigned to somebody you hope will know the resource owner’s brother’s nephew’s cousin’s former roommate.
This will result in a lengthy mean time to resolve (MTTR).
“Oh no it won’t,” you say. While they burn cycles sorting out ownership, you can log into your AWS console and change the configuration manually to bring it all back into compliance. Except for the fact that this is adding complexity to the issue (see drift above). Imperative changes to your cloud environment create drift which confuses the known state.
Automated tagging to the rescue
Enter Yor. Our solution for the automation of tagging IaC resources!
To learn more about how to accomplish resource tagging, check out your appropriate best practice guide from one of the major cloud providers (or keep reading for the TL;DR).
You might be amazed at the use cases a consistent resource tagging strategy can enable, from cost analysis to access control policies. In our case, such a strategy solves the issue of tracing a misconfigured and potentially insecure resource back to the original commit and owner, and takes us directly to the fix location in git.
Yor can be executed as simply as:
yor tag -d
Let’s take a quick look at an example Terraform template (s3.tf) before and after Yor tags the resources in Terraform.
While our original tags “Name” and “Environment” are still safe and sound, we have eight new tags by default. Seven are leveraging details we can already get from our git repo and one mysterious tag called yor_trace
.
Leveraging tags for faster MTTR
The yor_trace
tag is a unique identifier indicating the git commit in combination with the specific IaC resource.
While the detail from git is extremely helpful (it connects an identity to our runtime resources), yor_trace
goes a step further, leading us directly to our ideal fix location.
Take the above example and assume we check in the above untagged s3.tf
code. The CI/CD flow triggers our new Yor GitHub Action, which automatically adds or updates the tags for the resource. As a final step, we execute the terraform apply
on our newly tagged resources, including the above S3 bucket.
Fast forward a few moments and my SRE gets a notification that noncompliant resources have been deployed into our production environment and heads to Bridgecrew to check it out.
Clicking on the 38 noncompliant resources reveals a bonanza of misconfigurations, but we focus on the data first and look for any of those misconfigured S3 buckets that we keep hearing about in the news.
From here, it’s easy to click on the Resource Identifier, scroll down past the Details and View Configuration to reveal the associated “Tags” and presto!
While all of these tags are going to help me, the one I want immediately is yor_trace
. While I already know I can use the git_commit
tag to locate the commit, yor_trace
takes traceability to the next level, as it is a unique tag that will take me to the specific resource within the commit. From there I can see who, what, why, when, and how it happened and open a pull request that will fix it.
Additionally, I could use the Bridgecrew platform to find the same and similar misconfigurations along with their suggested fixes.
From here, I can go to my git repo, paste the yor_trace
tag into the search, and choose “in this repository” as the scope. Although because yor_trace
is one of a kind, I could search my entire organization or even the entirety of GitHub.
As expected, I’m presented with one result—also known as a GitHubwhack!
My cloud to code search is complete in a matter of minutes and I can now create a Jira ticket for the correct person (as indicated by the git committer). I’ll send them a Slack message to let them know where their attention is required to get this fixed ASAP.
MTTR stats smashed before breakfast! My next step is to set up an automated workflow with Slack to help automate our cloud to code reporting. We’ll cover exactly how to do that in a future blog.
Final thoughts
IaC has further moved us in the direction where environments can, like deployments in Kubernetes, be treated like cattle and not pets. It enables the ability to destroy resources and reliably recreate a known state. A common error is to adopt IaC but wait until a major drift or incident forces such a reset to be required.
Total destruction and recreation should technically work, but why wait for high-pressure circumstances such as a major drift detection or incident response to try it? Definitions of stability and security have changed. Unlike the days of data centers where a server that had been running successfully for years was considered stable, now such a server would be considered a liability.
If you are implementing IaC or thinking about doing so in the future, take a moment to learn about Bridgecrew’s open source tools—Yor, Checkov, AirIAM.