Innovation, Technology, and Life in the Cloud

George Watt

Prepare for More Cloud Outages

racks-and-wires_smallrecent outage at Amazon, which was caused by a network device failure, has reignited discussion and debate about the reliability of cloud computing services. While I believe that cloud services can be as reliable as any enterprise on-premise services – in fact they can be much more resilient than many cloud consumers could ever provide on their own – I believe this is likely the first of many more such incidents to come. Confused? Here’s why…

As many cloud services were first introduced they were deployed using brand shiny new equipment for everything from storage, to servers, to networks. Essentially, cloud teams enjoyed the advantage of almost any new endeavor. They could do new things in new ways without any baggage or legacy constraints. And as cloud services became more popular, some would argue mainstream, there was more investment.

New equipment tends not to break or malfunction, and today’s new equipment – especially the type of equipment used to create cloud services – tends to stay healthy for a fairly long time. For years.

Now cloud computing is fairly mature and some of that equipment will start to show its age. Some of it will start to malfunction, and some will not be able to handle increased capacity. In some cases a simple thing, such as an OS or firmware upgrade or patch, might cause older equipment’s performance to degrade, possibly significantly. In essence, cloud providers will now have to deal with the same issues that more mature deployments have been dealing with for years.

Regardless of whether Amazon’s failed network device was an older one, as cloud computing begins to develop a patina, expect more events like it that are related to aging infrastructure.

The good news is, of course, that this risk can be mitigated. Much of the required discipline and tools – asset management and refresh, predictive management (problem prevention), capacity planning, and reactive management and automation (minimizing problem impact and, in the best cases, making issues almost invisible) – have existed for a very long time. Some of these disciplines have evolved to better suit cloud environments, and good cloud providers can employ people who are best at these things.

As I have stated in the past, “not all clouds are created equal”. I have also cautioned, “resiliency is not a byproduct of cloud computing”. That is still the case. So do your homework and ask questions about your prospective provider’s strategy for things like hardware and software updates, and resilience in general.  The five key items I discussed in that article remain a good place to begin. What would you add to that list?

This blog is cross-posted at Cloud Storm ChasersGeorge is co-author of “The Innovative CIO.” 

One comment on “Prepare for More Cloud Outages

  1. Pingback: Cloud Computing: Embrace Diversity and Avoid The Money Pit | Pragmatic Cloud

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.