Monthly Archives: January 2018

Recommendations on using Terraform to manage Azure resources

If you’ve been working in the cloud infrastructure space for the last few years you can’t have missed the buzz around Hashicorp’s Terraform product. Terraform provides a declarative model for infrastructure provisioning that spans multiple cloud providers as well as on-premises services from the likes of VMWare.

I’ve recently had the opportunity to use Terraform to do some Azure infrastructure provisioning so I thought I’d share some recommendations on using Terraform with Azure (as at January 2018). I’ll also preface this post by saying that I have only been provisioning Azure PaaS services (App Service, Cosmos DB, Traffic Manager, Storage and Application Insights) and haven’t used any IaaS components at all.

In the beginning

I needed to provide an easy way to provision around 30 inter-related services that together constitute the hosting environment for single customer solution. Ideally I wanted a way to make it easy to re-provision these services as required.

I’ve used Azure Resource Manager (ARM) templates heavily in the past, but thought I could get some additional value from Terraform as it provides you with additional capabilities that aren’t present in ARM templates. As an example, right now you can’t provision Azure Storage Containers with ARM, but you can with Terraform.

I began, as I do with these sorts of templates, by incrementally defining resources and building the Terraform definition as I went. I got to the point where I decided to refactor some of the Terraform definitions to modularise the solution to hopefully make it a bit easier to understand and manage going forward.

When I did this refactor I also changed a bunch of resource naming schemes to better match my customer’s preferred standard. The net result of all this change was that I had a substantial amount of updates to be applied in my test lab that I had been incrementally updating as I went.

Now the fun begins

I ran ‘terraform plan’ which generated my execution plan (always make sure to provide an “out” parameter so you know ‘apply’ will match the plan exactly). I then ran ‘apply’ and left it running while I went to lunch.

When I returned about 45 minutes later my ‘terraform apply’ was still running, seemingly stuck on destroying one of the resources.

A quick visual check in the Portal of the Azure Resource Group these resources were in suggested that everything I wanted provisioned had been provisioned successfully.

Given this state of affairs I Ctrl+C’d the job, to which Terraform advised me:

Interrupt received.
Please wait for Terraform to exit or data loss may occur.

So, I gave the job a few more minutes to gracefully exit, at which point I sent another Ctrl+C and the job exited with this heart-warming message:

Two interrupts received. Exiting immediately. Note that data loss may have occurred.

Out of interest I immediately ran ‘terraform plan’ to understand what Terraform thought was provisioned versus what actually was.

The net result? Terraform had no idea that anything was provisioned!

A look at the local state file showed it was effectively empty. I restored the backup state file which it turned out was actually of minimal use because the delta between the backup and what I had just applied was too great – the resulting plan looked like an Azure resource massacre about to happen!

What to do?

I thought at this point that I was using the tooling incorrectly – how could I so easily get into this state? If I was using this to manage a production environment I’d be dead in the water.

Through additional reading and speaking with others, this is a known long-term pain point with Terraform – lose your local state and you are in a world of pain. At this time, you can’t even easily rebuild this local state without having to write a bunch of Imports which means you need to know what to import and you lose tracking of elements like random string generation at the same time.

Recommendations and Observations

Out of this experience I have some recommendations and observations around how I see Terraform (in its current state) fitting into environmental management in Azure:

  • Use Resource / Resource Group locks (delete or read-only) always: this applies even outside of use of Terraform. This will stop you from accidentally changing important resources. While you can include the definition of resource locks in your Terraform definitions I’d recommend you leave them out. If you use a Contributor-level user to do your deployments Terraform will fail when it tries to lock Resource Groups.
  • Make smaller, more frequent changes: this equates to a smaller delta between what’s in your state, and what’s in the plan. This means if you do need to recover state from backup you will have less of a change to deal with.
  • Consider your use of Terraform features like the ‘random string provider’ – you could move these to be input parameters that you can generate outside of Terraform. This means you create a fixed set of inputs, so that even if you lose state you can be assured that creating resources with “random” name components will be consistent with your last successful execution.
  • Use Resource Groups with small sets of Resources: fewer resources to deal with in event of a failure.
  • Consider Terraform as an initial provisioning tool for production and a re-provisioning tool for all dev / test and low complexity environments.
  • Use Terraform to detect drift: if you deploy an environment with Terraform, then setup the same definition as a CI build that simply runs ‘terraform plan’ against the deployed environment, using the state you generated on initial deployment as an input. If you have any change (add / delete) as the result of the ‘plan’ then you can fail the build and alert your team to investigate accordingly.
  • Consider for Blue / Green Infrastructure deployments for production only: if you want to push completely fresh infrastructure each time then Terraform is a good tool to consider. The usability of this approach is determined by complexity of your environment and the mix of utility / non-utility services you are deploying. This can work well with a slower cadence of release (monthly or above), even if your environment is fairly complex.
  • Use Azure Storage account backing for your state file (key for Terraform Open Source users). You can do this by setting up an Azure Storage Account and then defining the following in each of your TF files:
    terraform {
      backend "azurerm" {
        storage_account_name = "myterraformstore"
        container_name       = "tfstate"
      }
    }
    

    and then when you execute the init step you provide the additional parameters:

    terraform init -backend-config="access_key=*STORAGE_ACCNT_KEY" -backend-config="name.ofyour.tfstate"
    

    The shame here right now is you don’t get the versioning those who use AWS S3 buckets have access to.

  • Always write an ‘Import’ script once you’ve provisioned key environmental components you can’t afford to lose.

As a side note I notice that there is now an Azure Go SDK dependency for the Terraform Azure provider which is being maintained by Microsoft. I do wonder if this means that Terraform loses some of its appeal because new Azure features for Terraform will invariably be tied to the cadence and capabilities of the Go SDK which is generated against the official Microsoft Azure API. Will this become the way to block provider features that violate the Azure API definition? I guess time will tell.

As with all tools, Terraform has its strengths and weaknesses – hopefully as the product continues to mature we’ll see key features like re-build / import become part of the core value proposition (and not simply appear in the Enterprise version as a paid value add).

Tagged , , ,

Easy Filtering of IoT Data Streams with Azure Stream Analytics and JSON reference data

I am currently working on an next-gen widget dispenser solution that is gradually being rolled out to trial sites across Australia. The dispenser hardware is a modern platform that provides telemetry data that can be used for various purposes by the locations at which the dispenser is deployed and potentially by other third parties.

In addition to these next-gen dispensers we already have existing dispenser hardware at the locations that emits telemetry that we already use for other purposes in our solution. To our benefit both the new and existing hardware emits the same format telemetry data 🙂

A sample telemetry entry is shown below.

We take all of the telemetry data from new and old hardware at all our sites and feed it into an Azure Event Hub which allows us to perform multiple actions, such as archival of the data to Blob Storage using Azure Event Hub Capture and processing the streaming data using Azure Stream Analytics.

We decided we wanted to do some additional early stage testing with some of the next-gen hardware at a few sites. As part of this testing we also wanted to push the data for just specific hardware to a partner organisation we are working with. So how did we achieve this?

The first step was to setup another Event Hub. We knew this partner would not have any issues consuming event data from a Hub and it made the use of Stream Analytics an obvious way to process the incoming complete stream and ensure only the data for dispensers and sites we specify is sent to the partner.

Stream Analytics has the concept of Reference Data which takes the form of slow-moving (or static) data that can be read from a blob storage account in Azure.

We identified our site and dispensers and created our simple Reference Data JSON file – sample below.

The benefit of this format is that we can manage additional sites and dispenser by simply editing this file and uploading to blob storage! Stream Analytics even helps us by providing a useful naming scheme for files so you don’t even need to stop your Stream Analytics Job to update it! We uploaded our first file to a location that had the path of

/siterefdata/2018-01-09/11-40/sitedispensers.json

In future when we want to update the file, we edit it and then upload to blob storage at, say

/siterefdata/2018-02-01/00-00/sitedispensers.json

When the Job hits this date / time (UTC) it will simply pick up the new reference data – how cool is that?!

In order to use the Reference Data auto-update capability you need to set up the path naming scheme when you define the reference data as an input into the Stream Analytics Job. If you don’t need the above capability your can simply hard code the path to, say, a single file.

The final piece of the puzzle was to write a Stream Analytics Job that used the Reference Data JSON as one input and read the site identifier and dispensers from the included integer array. Thankfully, the in-built GetArrayElements Function came in handy, along with CROSS APPLY which gives us the ability to iterate over the elements and use them handily in the WHERE clause of the query!

The resulting solution now handily carves off the telemetry data for just the dispensers we want at the sites we list and writes them to an Event Hub the partner organisation can use.

I commented online that this sort of solution, and certainly one that scales as easily as this will, would have been something unachievable for most organisations even just a few years ago.

The cloud, and specifically Azure, has changed all of that!

Happy Days 😎

Tagged , , , ,