Tag Archives: DevOps

Microsoft Application Insights – APM for Everyone

When you work as heavily as I have with a technology like Application Insights you do tend to forget the amazing power you have at your fingertips.

Over the last few years I’ve come to rely heavily on Application Insights as the primary Application Performance Management (APM) tool of choice for services I build, whether they are hosted in Azure or not.

In this post I am going to take a quick walk through features that I think every developer should now about with Application Insights so they can also get maximum benefit from it too!

Your language has an SDK

Chances are pretty good that if you’re on a popular platform that Application Insights will have an SDK you can use. SDKs are great because adding them to a solution produces a bunch of default telemetry with nothing more than a Telemetry Key required.

The Application Insights team maintains their SDK documentation and SDK code references on Github. Needless to say .Net has great support, but Java, JavaScript and Node.js also get first-party support, with community support for Go, Python and Ruby. Want to do APM that includes native mobile experiences? No problem, drop in the HockeyApp SDKs.

Use it regardless of your hosting environment

Not using Azure to host your solution? Not a problem. If you can make outbound calls from your host to Application Insights then you can use Application Insights. ūüíĮ

Useful free tier

In an upcoming post I’ll talk more about perceived and actual value of free services in the cloud, but let me say for most basic scenarios the 5 GB of ingested Application Insights data per month will more than suffice. If not, you can manage your costs by moving to a sampling model that means you can still glean useful insights about your application’s behaviours without breaking the bank.

No features are removed at the free tier pricing tier either – you can still do full analytics on the log information that is captured!

Dependency tracking

The out-of-the-box dependency tracking is super handy to diagnose performance issues that result from upstream calls.

The only downside here is that the default capabilities are good at tracking HTTP-based dependencies, SQL Server, and not much else (at time of writing). Having said this, there is a published way for you to track other custom dependencies if needed, though it requires dedicated code – the out-of-the-box tracking requires no additional special code which is amazing!

I have to say that HTTP dependency tracking has been exceptionally useful in a REST-heavy environment, even tracking HTTP calls to external service providers like SendGrid, Twilio and others, providing us an easily accessible view on where our latency is arising from.

The sample below shows dependency behaviour for a single request to a caching service in an application. The very first request (at bottom of list) is a call to Cosmos DB which returns a 404 (Not Found) HTTP status code which then triggers a lookup of some data via a HTTP call to an API with the result returned then written to Cosmos DB for the next request. This is super useful information and I did precisely nothing to my code (other than add the Application Insights SDK to my solution) to capture this for every request!

Remote Dependencies

Track impact of releases

Application Insights has a REST API which allows you to add custom steps to Continuous Deployment pipelines to publish a Release Annotation to your timeline in Application Insights so you can see if a release impacts your solution.

Visual Studio Team Services’ Release Management will do this for you automatically, but if you aren’t using VSTS then you can still leverage this capability. A sample is shown below (thankfully we had no negative impact with this release!)

Release Annotation

Insights to your inbox

Super handy if you don’t want to go hunting for stats or you want to share aggregated stats with stakeholders.

App Insights Email

Heavy duty analytics

If the default experiences in the Azure Portal aren’t enough, then you can leverage the power of Azure Log Analytics to perform more detailed queries and drill into your data and build tables or graphs from the results.

A good example of this is the answer I provided to the following on Twitter from Troy.

Each request will be captured along with useful metadata (in this case from the underlying .Net codebase) which allows us to do further querying and filtering on the data.

Here’s a sample of such a request (this one is a HTTP request to an API endpoint) with the metadata shown which is needed to help solve Troy’s question.

Sample HTTP Request

The trick is then to head over to the Log Analytics environment…

Open Analytics

.. and then drill into the data to provide you with your desired answer.

Analytics query

You can then tabulate or graph the output. The above is a really simple query – trust me, you can do far more complicated than this!

Failure drill-in

This view has recently improved and become far more interactive – you can easily identify common reasons for failures and drill right in to, in my experience, identify root cause within a matter of moments!

In HTTP applications you do get a bit of expected noise (things like expected 401, 403 and 404 errors) which can be annoying to sift through, particuarly for REST-type APIs, but it’s a small price to pay for the power you get!

Failures View

Availability Checks, Health Alerts and Smart Detection

I’m not going into these in too much detail, but you can also set Alerts and health checks in Application Insights and the service will also do analysis of trends and alert you to items that may require your attention (even if you don’t have a specific rule set).

Custom Events, User Journeys and Cohorts

Like health checks I am not going to go through these in detail, but if this is the sort of insight you need, then it is possible to access it here too. If you need to log custom data in Application Insights you can do that too using Custom Events.

What are you waiting for?!

I can honestly say I would be hard pressed these days to build anything without including Application Insights in it, particularly if I won’t have direct access to the hosting environment.

Troubleshooting runtime issues becomes much easier with the details you can glean from walking request stacks as presented by Application Insights. I’ve isolated and fixed more than my fair share of runtime issues (mostly configuration related) without ever needing to try and reproduce locally because I could quickly tell via the telemetry where things were going wrong.

Happy days! ūüėé

Tagged , , ,

Use Azure Health to track active incidents in your Subscriptions

Yesterday afternoon while doing some work I ran into an issue in Azure. Initially I thought this issue was due to a bug in my (new) code and went to my usual debugging helper Application Insights to review what was going on.

The below graphs a little old, but you can see a clear spike on the left of the graphs which is where we started seeing issues and which gave me a clue that something was not right!

App Insights views

Initially I thought this was a compute issue as the graphs are for a VM-hosted REST API (left) and a Functions-based backend (right).

At this point there was no service status indicating an issue so I dug a little deeper and reviewed the detailed Exception information from Application Insights and realised that the source of the problem was the underlying Service Bus and Event Hub features that we use to glue together our services.

You can see the increased error rate from the Service Bus Metrics view below.

Service Bus Metrics

While I was doing this an alert popped up in the Portal advising a service incident and directed me to the Azure Service Health feature in order to view the full incident details and also to track it.

On the Azure Health page I could see an active incident and decided to try out the alerting feature to track this during a commute home.

I clicked on the Add Alert option and configured a new email-based alert. You can also push alerts into your preferred IT Service Management (ITSM) solution, but we aren’t yet using an ITSM platform for this solution, but this would be our choice in future!

In Services I chose Service Bus and Event Hubs and for Regions I selected the two Australian Regions. Note that I had to set up an Action Group as I hadn’t used the feature previously – in the screenshot below I am just reusing the one I previously setup.

Alerts Setup

A short while after saving the Alert configuration the recipients in the Action Group started to receive update emails containing the most recent status of the incident. A sample is shown below.

Notice Email

About 45 minutes after this alert we received a resolution notification.

The amount of time saved for our team with the ease of this setup is pretty amazing, and if you’re not using this feature already you should go and explore it in the Portal and set it up for you key Azure components.

What a great early Christmas present!

ūüėé

Tagged , , , , ,

Secure your VSTS Release Management Azure VM deployments with NSGs and PowerShell

One of the neat features of VSTS’ Release Management capability is the ability to deploy to Virtual Machine hosted in Azure (amongst other environments) which I previously walked through setting up.

One thing that you need to configure when you use this deployment approach is an open TCP port to the Virtual Machines to allow remote access to PowerShell and WinRM on the target machines from VSTS.

In Azure this means we need to define a Network Security Group (NSG) inbound rule to allow the traffic (sample shown below). As we are unable to limit the source address (i.e. where VSTS Release Management will call from) we are stuck creating a rule with a Source of “Any” which is less than ideal, even with the connection being TLS-secured. This would probably give security teams a few palpitations when they look at it too!

Network Security Group

We might be able to determine a source address based on monitoring traffic, but there is no guarantee that the Release Management host won’t change at some point which would mean our rule blocks that traffic and our deployment breaks.

So how do we fix this in an automated way with VSTS Release Management and provide a secured environment?

Let’s take a look.

The Fix

The fix is actually quite straightforward it turns out.

As the first step you should go to the existing NSG and flip the inbound rule from “Allow” to “Deny”. This will stop the great unwashed masses from being able to hit TCP port 5986 on your Virtual Machines immediately.

As a side note… if you think nobody is looking for your VMs and open ports, try putting a VM up in Azure and leaving RDP (3389) open to “Any” and see how long it takes before you start seeing authentication failures in your Security event log due to account enumeration attempts.

Modify Project Being Deployed

We’re going to leverage an existing Release Management capability to solve this issue, but first we need to provide a custom PowerShell script that we can use to manipulate the NSG that contains the rule we are currently using to block inbound traffic.

This PowerShell script is just a simple wrapper that combines Azure PowerShell Cmdlets to allow us to a) read the NSG b) update the rule we need c) update the NSG, which commits the change back to Azure.

I usually include this script in a Folder called “Deploy” in my project and set the build action to “Copy always”. As a result the file will be copied to the Artefacts folder at build time which means we have access to it in Release Management.

Project Setup

You should run a build with this included file so that it is available in your

Modify Release Management Defintion

Note that in order to complete this step you must have a connection between VSTS and your target Azure Subscription already configured as a Service Endpoint. Typically this needs to be done by a user with sufficient rights in both VSTS and the Azure Subscription.

Now we are going to modify our existing Release Management definition to make use of this new script.

The way we are going to enable this is by using the existing Azure PowerShell Task that we have available in both Build and Release Management environments in VSTS.

I’ve shown a sample where I’ve added this Task to an existing Release Management definition.

Release Management Definition

There is a reason this Task is added twice – once to change the NSG rule to be “Allow” and then once, at the end, to switch it back to “Deny”. Ideally we want to do the “Allow” early in the process flow to allow time for the NSG to be updated prior to our RM deployment attempting to access the machine(s) remotely.

The Open NSG Task is configured as shown.

Allow Script

The Script Arguments should match those given in the sample script above. As sample we might have:

-resourceGroupName MyTestResourceGroup -networkSecurityGroupName vnet01-nsg 
-securityRuleName custom-vsts-deployments -allowOrDeny Allow -priority 3010

The beauty of our script is that the Close NSG Task is effectively the same, but instead of “Allow” we put “Deny” which will switch the rule to blocking traffic!

Make sure you set the “Close” Task to “Always run”. This way if any other component in the Definition fails we will at least close up the NSG again.

Additionally, if you have a Resource Group Lock in place (and you should for all production workloads) this approach will still work because we are only modifying an existing rule, rather than trying to add / remove it each time.

That’s it!

You can now benefit from VSTS remote deployments while at the same time keeping your environment locked down.

Happy days ūüôā

Tagged , , , , ,

The False Promise of Cloud Auto-scale

Go to the cloud because it has an ability to scale elastically.

You’ve read that bit before right?

Certainly if you’ve been involved in anything remotely cloud¬†related¬†in the last few years you will most certainly have come across the concept of on-demand or elastic scaling. Azure has it, AWS has it, OpenStack has it. ¬†It’s one of the cornerstone pieces of any public cloud platform.

Interestingly it’s also one of the facets of the cloud that I see often misunderstood by developers and IT Pros alike.

In this post I’ll explore why auto-scale isn’t black magic and why your existing applications can’t always just automagically start using it.

What is “Elastic Scalability”?

Gartner (2009) defines it as follows:

“Scalable and Elastic: The service can scale capacity up or down as the consumer demands at the speed of full automation (which may be seconds for some services and hours for others). Elasticity is a trait of shared pools of resources. Scalability is a feature of the underlying infrastructure and software platforms. Elasticity is associated with not only scale but also an economic model that enables scaling in both directions in an automated fashion. This means that services scale on demand to add or remove resources as needed.”

Source: http://www.gartner.com/newsroom/id/1035013

Sounds pretty neat – based on demand you can utilise platform features to scale out your application automatically. ¬†Notice I didn’t say scale up your application. ¬†Have a read of this Wikipedia article¬†if you need to understand the difference.

On Microsoft Azure, for example, we have some cool sliders and thresholds we can use to determine how we can scale out our deployed solutions.

Azure Auto Scale Screenshot

Scale this service based on CPU or on a schedule.

What’s Not to Understand?

In order to answer this¬†we should examine how we’ve tended to build and operate¬†applications in the on-premise world:

  • More instances of most software means more money for licenses. While you¬†might get some cost relief for warm or cold standby you are¬†going to have to pony up the cash if you want to run more than a single instance of most off-the-shelf software in warm or hot standby mode.
  • Organisations have shied away from multi-instance¬†applications to avoid needing to patch and maintain additional operating systems and virtualisation hosts¬†(how many “mega” web servers are running in your data centre that host many web sites?)
  • On-premise compute resources are finite¬†(relative to the cloud). ¬†Tight control of used resources leads to the outcome in the previous point – consolidation takes place because that hardware your company bought needs to handle¬†the next 3 years of growth.
  • Designing and building an application that can run in a multi-instance configuration¬†can be hard (how many web sites are you running that need “sticky session” support on a load balancer to work properly?) ¬†Designing and building¬†applications that are stateless at some level may be viewed by many as black magic!

The upshot of all these above points¬†is that we have tended to a “less is more” approach when building or operating solutions on premise. ¬†The¬†simplest mode of hosting the application in a way that meets business availability needs is typically the one that gets chosen. Anything more is a luxury (or a complete pain to operate!)

So, How to Realise the Promise?

In order to fully leverage auto-scale capabilities we need to:

  • Adopt¬†off-the-shelf¬†software that provides¬†a consumption-based licensing model. Thankfully in many cases we are already here – we can run many enterprise operating system, application and database software solutions using a pay-as-you-go (PAYG) scheme. ¬†I can bring my own license if I’ve already paid for one too. ¬†Those vendors who don’t offer this flexibility will eventually be left behind as it will become a competitive advantage for others in their field.
  • Leverage programmable infrastructure via automation¬†and a¬†culture shift to¬†“DevOps” within our¬†organisations. ¬†Automation¬†removes the need for manual completion of many¬†operational tasks thus enabling¬†auto-scale scenarios.¬† The¬†new collaborative¬†structure of DevOps¬†empowers our operational teams to be more agile and to¬†deliver more innovative solutions than they perhaps have done in the past.
  • Be clever about measuring what our minimum and maximum thresholds are for acceptable user experience. ¬†Be prepared to set those CPU sliders lower or¬†higher than you might otherwise have if you were doing the same thing on-premise. ¬†Does the potential beneficial performance of auto-scale at a lower CPU utilisation level out-weigh the marginally small cost you pay given that the platform will scale back as necessary?
  • Start building applications for the cloud. ¬†If you’ve designed and built applications with many stateless components already then you may have little work to do. ¬†If you haven’t then be prepared to deal with the technical debt to fix (or start over). ¬†Treat as much of your application’s components as you can as cattle and minimise the pets (read¬†a definition that hopefully clears up that analogy).

So there we have a few things we need to think about when trying to realise the potential value of elastic scalability.  The bottom line is your organisation or team will need to invest time before moving to the cloud to truly benefit from auto-scale once there.  You should also be prepared to accept that some of what you build or operate may never be suitable for auto-scale, but that it could easily benefit from manual scale up or out as required (for example at end of month / quarter / year for batch processing).

HTH

Tagged , , , , ,