Use Azure Health to track active incidents in your Subscriptions
- Published on
- Reading time
- Authors
- Name
- Simon Waight
- Mastodon
- @simonwaight
Yesterday afternoon while doing some work I ran into an issue in Azure. Initially I thought this issue was due to a bug in my (new) code and went to my usual debugging helper Application Insights to review what was going on.
The below graphs a little old, but you can see a clear spike on the left of the graphs which is where we started seeing issues and which gave me a clue that something was not right!
Initially I thought this was a compute issue as the graphs are for a VM-hosted REST API (left) and a Functions-based backend (right).
At this point there was no service status indicating an issue so I dug a little deeper and reviewed the detailed Exception information from Application Insights and realised that the source of the problem was the underlying Service Bus and Event Hub features that we use to glue together our services.
You can see the increased error rate from the Service Bus Metrics view below.
While I was doing this an alert popped up in the Portal advising a service incident and directed me to the Azure Service Health feature in order to view the full incident details and also to track it.
On the Azure Health page I could see an active incident and decided to try out the alerting feature to track this during a commute home.
I clicked on the Add Alert option and configured a new email-based alert. You can also push alerts into your preferred IT Service Management (ITSM) solution, but we aren't yet using an ITSM platform for this solution, but this would be our choice in future!
In Services I chose Service Bus and Event Hubs and for Regions I selected the two Australian Regions. Note that I had to set up an Action Group as I hadn't used the feature previously - in the screenshot below I am just reusing the one I previously setup.
A short while after saving the Alert configuration the recipients in the Action Group started to receive update emails containing the most recent status of the incident. A sample is shown below.
About 45 minutes after this alert we received a resolution notification.
The amount of time saved for our team with the ease of this setup is pretty amazing, and if you're not using this feature already you should go and explore it in the Portal and set it up for you key Azure components.
What a great early Christmas present!
😎