Recommendations on using Terraform to manage Azure resources

If you've been working in the cloud infrastructure space for the last few years you can't have missed the buzz around Hashicorp's Terraform product. Terraform provides a declarative model for infrastructure provisioning that spans multiple cloud providers as well as on-premises services from the likes of VMWare.

I've recently had the opportunity to use Terraform to do some Azure infrastructure provisioning so I thought I'd share some recommendations on using Terraform with Azure (as at January 2018). I'll also preface this post by saying that I have only been provisioning Azure PaaS services (App Service, Cosmos DB, Traffic Manager, Storage and Application Insights) and haven't used any IaaS components at all.

In the beginning

I needed to provide an easy way to provision around 30 inter-related services that together constitute the hosting environment for single customer solution. Ideally I wanted a way to make it easy to re-provision these services as required.

I've used Azure Resource Manager (ARM) templates heavily in the past, but thought I could get some additional value from Terraform as it provides you with additional capabilities that aren't present in ARM templates. As an example, right now you can't provision Azure Storage Containers with ARM, but you can with Terraform.

I began, as I do with these sorts of templates, by incrementally defining resources and building the Terraform definition as I went. I got to the point where I decided to refactor some of the Terraform definitions to modularise the solution to hopefully make it a bit easier to understand and manage going forward.

When I did this refactor I also changed a bunch of resource naming schemes to better match my customer's preferred standard. The net result of all this change was that I had a substantial amount of updates to be applied in my test lab that I had been incrementally updating as I went.

Now the fun begins

I ran 'terraform plan' which generated my execution plan (always make sure to provide an "out" parameter so you know 'apply' will match the plan exactly). I then ran 'apply' and left it running while I went to lunch.

When I returned about 45 minutes later my 'terraform apply' was still running, seemingly stuck on destroying one of the resources.

A quick visual check in the Portal of the Azure Resource Group these resources were in suggested that everything I wanted provisioned had been provisioned successfully.

Given this state of affairs I Ctrl+C'd the job, to which Terraform advised me:

Interrupt received.
Please wait for Terraform to exit or data loss may occur.

So, I gave the job a few more minutes to gracefully exit, at which point I sent another Ctrl+C and the job exited with this heart-warming message:

Two interrupts received. Exiting immediately. Note that data loss may have occurred.

Out of interest I immediately ran 'terraform plan' to understand what Terraform thought was provisioned versus what actually was.

The net result? Terraform had no idea that anything was provisioned!

A look at the local state file showed it was effectively empty. I restored the backup state file which it turned out was actually of minimal use because the delta between the backup and what I had just applied was too great - the resulting plan looked like an Azure resource massacre about to happen!

What to do?

I thought at this point that I was using the tooling incorrectly – how could I so easily get into this state? If I was using this to manage a production environment I'd be dead in the water.

Through additional reading and speaking with others, this is a known long-term pain point with Terraform – lose your local state and you are in a world of pain. At this time, you can't even easily rebuild this local state without having to write a bunch of Imports which means you need to know what to import and you lose tracking of elements like random string generation at the same time.

Recommendations and Observations

Out of this experience I have some recommendations and observations around how I see Terraform (in its current state) fitting into environmental management in Azure:

Use Resource / Resource Group locks (delete or read-only) always: this applies even outside of use of Terraform. This will stop you from accidentally changing important resources. While you can include the definition of resource locks in your Terraform definitions I'd recommend you leave them out. If you use a Contributor-level user to do your deployments Terraform will fail when it tries to lock Resource Groups.
Make smaller, more frequent changes: this equates to a smaller delta between what's in your state, and what's in the plan. This means if you do need to recover state from backup you will have less of a change to deal with.
Consider your use of Terraform features like the 'random string provider' - you could move these to be input parameters that you can generate outside of Terraform. This means you create a fixed set of inputs, so that even if you lose state you can be assured that creating resources with "random" name components will be consistent with your last successful execution.
Use Resource Groups with small sets of Resources: fewer resources to deal with in event of a failure.
Consider Terraform as an initial provisioning tool for production and a re-provisioning tool for all dev / test and low complexity environments.
Use Terraform to detect drift: if you deploy an environment with Terraform, then setup the same definition as a CI build that simply runs 'terraform plan' against the deployed environment, using the state you generated on initial deployment as an input. If you have any change (add / delete) as the result of the 'plan' then you can fail the build and alert your team to investigate accordingly.
Consider for Blue / Green Infrastructure deployments for production only: if you want to push completely fresh infrastructure each time then Terraform is a good tool to consider. The usability of this approach is determined by complexity of your environment and the mix of utility / non-utility services you are deploying. This can work well with a slower cadence of release (monthly or above), even if your environment is fairly complex.
Use Azure Storage account backing for your state file (key for Terraform Open Source users). You can do this by setting up an Azure Storage Account and then defining the following in each of your TF files:
```
terraform {
  backend "azurerm"
  {
    storage_account_name = "myterraformstore"
    container_name = "tfstate"
  }
}
```
and then when you execute the init step you provide the additional parameters:
```
terraform init -backend-config="access_key=*STORAGE_ACCNT_KEY" \
               -backend-config="name.ofyour.tfstate"
```
The shame here right now is you don't get the versioning those who use AWS S3 buckets have access to.
Always write an 'Import' script once you’ve provisioned key environmental components you can't afford to lose.

As a side note I notice that there is now an Azure Go SDK dependency for the Terraform Azure provider which is being maintained by Microsoft. I do wonder if this means that Terraform loses some of its appeal because new Azure features for Terraform will invariably be tied to the cadence and capabilities of the Go SDK which is generated against the official Microsoft Azure API. Will this become the way to block provider features that violate the Azure API definition? I guess time will tell.

As with all tools, Terraform has its strengths and weaknesses - hopefully as the product continues to mature we'll see key features like re-build / import become part of the core value proposition (and not simply appear in the Enterprise version as a paid value add).