Gone in 60 Seconds – Cloud Load Balancing Timeouts

Update: July 2014: AWS announced that customers can now manage ELB timeouts.  Read more here: https://aws.amazon.com/blogs/aws/elb-idle-timeout-control/

After what seems like ages we finally managed to get to the bottom of the weird issues we were having with a load balanced instance of Umbraco hosted at Amazon Web Services.  Despite having followed Umbraco’s load balancing guide and extensively tested we were seeing some weird behaviours for multi-page publishing and node copy operations.  Multiple threads were created for each request which lead to unexpected outcomes for users.

Eventually we managed to isolate this to the default (and unchangeable at time of writing) 60 second timeout behaviour of AWS’ ELB infrastructure combined with the long-running nature of the POST requests from Umbraco.  The easiest way to see the real side effect of the 60 second timeout is to put a debugging proxy like Fiddler or Charles between your browser and the ELB.  What we saw is below.

gsixty

So, you can see the culprit right there in the red square – the call to the publish.aspx page is terminated at 60 seconds by the ELB which causes the browser to resubmit it – Ouch!  This also occurs when you copy or move nodes and the process exceeds 60 seconds – you get multiple nodes!

To be clear – this is not a problem that is isolated to Umbraco – there is a lot of software that relies on long-running HTTP POST operations with the expectation that they will run to completion.

Now there are probably a range of reasons why AWS has this restriction – the forum posts (dating back to 2009) don’t enlighten but it’s not hard to see why, in an “elastic” environment anything that takes a long time to complete may be a bad thing (you can’t “scale up” your ELB if it’s still processing a batch of long-running requests).  I can see the logic to this restriction – it simplifies the problems the AWS engineers need to solve, but it does introduce a limitation that isn’t covered clearly enough in any official AWS documentation.

The real solution here has to come from better software design that takes into account this limitation of the infrastructure and makes use of patterns like Post-Redirect-Get to submit a short POST request to initiate the process on the server, redirect to another page and then utilise async calls from the browser to check on the status of the process.

Yes, I know, we could probably run our own instances with HA Proxy on, but why build more infrastructure to manage when what’s there is perfectly fit for purpose?

Updated – You Have An Alternative

10 September – I’ve been lucky enough to be attending the first AWS Achitecture course run by Amazon here in Sydney and the news on this front is interesting.  By default you get 60 seconds, *but* you can request (via your AWS Account Manager or Architect) that this timeout be increased up to 17 minutes maximum.  This is applied on a per-ELB basis so if you create more ELB instances you would need to make the same request to AWS.

My advice: fix your application before you ask for a non-standard ELB setup.

Update: July 2014: AWS announced that customers can now manage ELB timeouts.  Read more here: https://aws.amazon.com/blogs/aws/elb-idle-timeout-control/

Not Just For WS ELB

Now, chaps (and ladies), you also need to be aware that this issue will raise its head in Windows Azure as well but most likely after a longer duration.  A very obliquely written blog post on MSDN suggests it will be now be based on the duration AND the number of concurrent connections you have.

You have been warned!

Tagged , , ,

9 thoughts on “Gone in 60 Seconds – Cloud Load Balancing Timeouts

  1. x30n says:

    This explains my pain… thank you! and no thank you to AWS! Some queries are just big arsed long running queries… so maybe i go add several processors to the SQL side and see if i can get the time down.

    • Simon says:

      If you have the ability to you can also look at modifying your web tier to use a different pattern when making POST requests that are likely to exceed the timeout. If you’re using something like SSRS then you may be out of luck though!

  2. jp says:

    It seems like the “Umbraco Adminsitration” point here would be a solution as well: going directly to a single server for admin. Specifically going to that admin server not through the ELB. Would that be a potential resolution? (Asking as someone who is looking at implementing a scalable Umbraco installation using Amazon.)

    Of course, if what you’re trying to load-balancing is the Umbraco back-end then this suggestion is moot. But I’m worried about scaling the front-end, not the Umbraco administration.

    • Simon says:

      JP, it’s a valid approach to feed just a single address to your CMS users (we are actually doing this for our main Umbraco/AWS customer). You just need to make sure that you have hostnames setup appropriately for all your nodes and that you consider mapping elastic IPs to those hosts so that a restart / stop of the Instance doesn’t mean you break any DNS aliases you have setup.

      Also note that you’ll need to enable the distributed publishing features of Umbraco in order to support multiple CMS servers – this still applies even if you are only going to provide users access to a single CMS instance. If you don’t do this and you have the local XML caching features enabled then you will find one or more of your nodes will not be updated. One gotcha here – you must include the local server in the distributed publish as once enabled Umbraco will only publish via the distributed call interface (even on the local host).

      If you have any questions let me know and I can email you.

      • Thanks for the answers Simon and the tips! So the single administration instance doesn’t have an ELB long POST issue – great to know.

        I will absolutely take you up on email once I start in on the work – very kind of you. Ping me via email with your address.

  3. JP says:

    How did you setup your file sharing? One of the servers simply had the files and the others pointed at that one?

    • Simon says:

      We wrote an extension for Umbraco to push uploaded media to an S3 bucket which we then use in our templates when building URLs to media. The benefit with this is you could expose the media via CloudFront for better distributed performance.

      Alternatively you could setup using VPC and run Active Directory to give you support for DFS between the nodes.

      • I should have been more specific: I’m wondering about the files like the Examine Indices (Indexes?), XML Cache file, and the like.

        S3 is definitely in the plan for media! Does your plugin tie into the Media Publish event?

      • Simon says:

        You need to enable distributed publishing in Umbraco to update the cache across multiple nodes. This is held in the umbracoSettings.config file and should have all the nodes listed.

        Our S3 extension for media is exposed in the CMS as a custom data type and editor that is used for managing media uploads.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: