IaC Philosophy
TL;DR: Terraform Native isn’t good enough for complex enterprise cloud footprints and their processes. Use CDK for Terraform to express your infrastructure in an actual language and use the expressiveness of that language to handle all of the oddities that come with managing large clouds. See how here: Example Enterprise Cloud
I think the first thing I want to put out there is my take on Infrastructure as Code. My view on Infrastructure is that it should be managed like software. We plan changes like software development, develop and iterate like software, deploy and release as software, etc. and our tools, including the language we use to express our infrastructure should reflect that. What this means is that managing infrastructure, in my opinion, has evolved past the point of being easily expressed by things like Terraform’s DSL or Cloudformation YAML. I think the devops industry would agree with this opinion which is why we’re starting to see the adoption of things like the AWS CDK, Terraform’s CDKTF, or even proprietary solutions like Pulumii. Large and complex enterprise cloud infrastructures just fit better into the expressiveness of a programming language.
Why though? What’s wrong with Terraform native or YAML?
Well, nothing necessarily. Especially not when viewed through the lens of personal projects. However, when you expand your cloud footprint to hundreds of developers, dozens of AWS accounts, compliance enforcement, release processes, auditability, etc. The things that we would like to be able to express and enforce in Terraform native or YAML become difficult. Then we end up with large repositories isolated from the overall ecosystem dedicated to representing the current cloud footprint. They often become havens for copied and pasted chunks of resources or modules, large variable or configuration files dedicated to specific deployment patterns, and unique one-off architectures that don’t fit into the overall ecosystem, blah, blah blah.
Managing the infrastructure with the upfront mindset that it is software naturally pushes us to express the architecture in the same ways we would write good code. Programming languages allow us to write our infrastructure in a way that enforces things like single responsibilities, or dependency inversion which keeps our codebase concise and testable. It prevents our code from sprawling into a whirlwind of modules and variable files because we’re mostly working with abstractions.
Ok fine, maybe that’s true, but infrastructure doesn’t produce the same product that a software release does so why should I expect to manage it like software? For example, sometimes when I need to do a deployment, even if I’m using Terraform, I still have to be responsible for handling the deployment correctly, like removing state, or handling destructive operations correctly etc.
This general idea is something I agree with in part. Sometimes infrastructure changes are tricky, and sometimes we don’t have the luxury of just producing a net new container to run or server to spin up. Infrastructure deployments are deploying changes to something that already exists which is a different deployment pattern than most software releases, this is true. But you know who has been dealing with this decades before the cloud?
DBAs.
Anybody who has maintained a large, complex, changing database schema has dealt with Database migrations. The idea is extremely similar to our infrastructure changes…we need to take the database from State A to State B and along the way we need to do a little handholding to make sure the application doesn’t burn down in the middle. Sometimes our infrastructure deployments are the same, and at the time of writing this, Terraform native is not expressive enough to handle this. When we deploy Terraform to go from State A to State B, we’re at the mercy of how that change has been implemented into Terraform and in some parts into the AWS API. I would argue that writing your infrastructure in a programming language immediately awards us the flexibility to begin writing Infrastructure migrations in the same sort of way we would write Database migrations. A custom deployment process to take our infrastructure from State A to State B and vice versa.
Okay, but wait, are you saying we shouldn’t use Terraform?
No, not at all. Terraform is a phenomenal state manager, that’s its key selling point. We absolutely should be using Terraform to manage the state of our infrastructure. However, the DSL of Terraform native is just not expressive enough at the moment. Something like the CDK for Terraform is a perfect merging of worlds. It allows us to write our infrastructure in a programming language and the deployment process will transpile that code into something Terraform can work with so we can take advantage of Terraform’s state manager.
This is a lot of opinion without a lot of evidence which is why I put together this repo: Example Enterprise Cloud. In this project, I’m going to go through how a large Enterprise might leverage the CDK for Terraform to define its infrastructure and manage it. I’ll cover things like managing multiple AWS accounts across multiple regions, creating a deployment process you might find in this enterprise, enforcing various compliance policies, and building an extensible library for our infrastructure. At each checkpoint, I’ll write up a post about what we’ve done, link the commit, and post it in the repo’s README so you can get a snapshot of how the repo has evolved over time.