Innovation, Technology, and Life in the Cloud

George Watt

Pragmatic Cloud: Cloudy Views Bring Sunny Outlook for Cloud Consumers

“#CloudViews” Cloud Outage Chat Participants Put Their Customers First

Last Thursday I participated as a panelist in Cloud Commons‘ “#CloudViews” Twitter chat (partial session archive here or page through the full archive here).  The following is a brief summary of that event.

Put the Customers’ Interests First

Though the topic of this chat session was “Cloud Outages” there was, I believe, another clear theme:  It’s all about the consumer.  It’s all about the customer.  And the participants care about the well-being of the businesses to which they provide service.  Whilst this was demonstrated in a somewhat subtle way in numerous posts, some of them were quite straight forward.

Transparency is Paramount

Closely connected to the underlying theme of respect for our customers was a very active discussion regarding transparency of providers when service is disrupted.  Participants weighed in from both customer and provider perspectives.  For example, this excerpt from an exchange started by Jonathan Davis of DNS Europe who offered his opinion on the service provider perspective:

Jay Fry’s comment resulted in much agreement and was widely reposted.  Christoph Streit of ScaleUp agreed:

And this exchange from the customer’s perspective generated much agreement, including from the service provider community in attendance, as can be seen through responses from Jonathan and from Mimecast’s Justin Pirie.

(This topic produced much conversation.  Posts were too numerous to include all of them.  I apologize to those whom I omitted. )

So, it was incredibly encouraging to see so much agreement on the importance of best practices, customer focus, and ethical conduct.

Built-in and Built-On Resilience

Yes, there was also discussion of service outages and resilience – and a lot of it.  There were many good perspectives on how providers, application architects, and consumers can deliver resilience.  I believe there was also nearly unanimous agreement that components can and will fail, and that services must be architected to address that.  (Please visit the chat archive for other examples.)

I have attempted to extract a representative sampling of key points made throughout the discussion and share it via the list below.  Before I share that I would like to answer a question asked by my colleague, Andi Mann, during the session that I missed as posts flew past.  (Apologies for not catching that, Andi.)  In response to one of my posts that stated resilience can be “built-in” to the cloud platform or “built-on” via the application or service Andi asked:

When I referred to “built-in” resilience I was referring to the things that the service providers have added to their services in order to ensure that their customers experience no loss of service when a component fails.  The providers who joined the session discussed many of these things such as N+1 environments, clustering, and geographically disbursed data centers.

As we have witnessed recently, even when such precautions are taken a service can suffer an outage.  There are many reasons this can happen ranging from a new type of issue surfacing for which the provider was not prepared, to cases where, through no fault of the provider (their service remains active) the customer (composite application…) is unable to connect to the service.  In order to address this, and to ensure that services are not disrupted even in these cases (to make sure nobody notices) application architects are building cloud-savvy resilience into their solutions (into the application).  This is what I referred to as “built-on“, since it sits “on top of” any resilience “built-in” by the service providers, and since it adds a/another layer of protection.  Netflix’ “Rambo Architecture” and its use of “Chaos Monkeys is a good example of this.

The tweet chat panelists shared and discussed many great tips and lessons learned.  While approaches to specific issues were different at times, generally there was broad agreement in many areas.  Participants tended to agree on the following:

  • Components and services will fail
  • Not all failures are predictable or preventable
  • Consumers must be prepared for outages in both “traditional” on-premise environments and in clouds
  • It is possible to provide continuous service even when a service fails
  • Many service providers are aware of the importance of resilience and are taking action to provide it
  • Resilience can and should be built into both the cloud service (“platform”) and the business application (“built-in” and “built-on”)
  • A poorly designed application when moved to the cloud is still a poorly designed application
  • Many traditional on-premise applications are not built to “expect” failure and that can add complexity when moving those applications to the cloud
  • Not all services require the same level of resilience
  • Providers and consumers must work together to evaluate risk and determine the best appropriate level of resilience for their services
  • Consumers may need assistance to determine the criticality of their applications, the risk they can tolerate, their best approach to resilience, and balancing that with cost
  • Assume components and services will fail, plan for that, test the plan
  • Test the plan at an appropriate frequency
  • Did we mention test the plan?
  • Services and environments change, so keep plans fresh (and yes, test them)
  • Failures and outages should not be hidden from consumers/customers
  • Transparency regarding outages is key: Customers should be informed and providers should communicate proactively

In addition to these items, several tips were shared such as this one:

I quite enjoyed the session and was very pleased with the level of active participation, with the great information that was shared, and with the level of respect the participants offered one-another, even when their views were different.  So I would like to offer a sincere thank you to the chat participants.  If I missed something important please do let me know.

To all who were kind enough to read this:  What other words of wisdom would you offer regarding cloud outages?  We would also greatly appreciate suggestions for topics for future chat sessions.

This blog is cross-posted at Cloud Storm Chasers.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Information

This entry was posted on September 19, 2011 by in Cloud, Resilience and tagged , , , , , .

About

This site contains articles regarding the practical aspects of deploying, providing, managing, and using cloud computing technologies; though much of the information is applicable to most information technologies. I also share my thoughts and experiences related to innovation, consumer driven IT, social media, management issues, and about what some refer to as “soft skills”.

All works copyright (C) 2009 - 2014 George Watt - All rights reserved.

Twitter Updates

%d bloggers like this: