Search This Blog

Showing posts with label Outages. Show all posts
Showing posts with label Outages. Show all posts

Monday, March 31, 2008

How many outages can a Blackberry user stand?

Boy, how many more Blackberry outages will it take before businesses start providing negative feedback to Research In Motion (RIM) in the form of cancellations? Your guess is as good as mine. But given that this weekend's outage makes three in the last year, RIM has some explaining to do to its customer base.

Tuesday, March 11, 2008

Anywhere services need redundancy, not single points of failure

rope.jpg

I've got to chime in with Jon Gruber of Daring Fireball on this: No matter how much Research in Motion promotes their NOC approach to running its Blackberry service, it's still a single point of failure for all Blackberry subscribers. And given that this weakness has been demonstrated to Blackberry subscribers with two multi-hour outages in the last 11 months, at some point, businesses are going to scream "Fix it!" I'm surprised someone from the high-availability computing world hasn't pilloried RIM already.

For those who don't think RIM's outages are any big deal, here's a fun fact. If RIM were trying to meet a 99.999% availability for its Blackberry service, the three-hour outage on February 12, 2008 would have used up its allowed downtime for the next 34 years. Oops.

It's easy to forget that until the Internet came along and demonstrated that distributed and decentralized networks really were more reliable, most of the major computer companies built networks with centralized command and control systems, yet those networks never achieved anything like the resilience of the Internet. It's a shame Anywhere business customers using Blackberrys are going to have to learn that lesson again the hard way. It's not a question of if that will happen; it's a question of when.

Meanwhile, everyone who wants to avoid the Anywhere school of hard knocks should repeat after me:

I will not accept single points of failure in my Anywhere service.
I will not accept single points of failure in my Anywhere service.
I will not accept single points of failure in my Anywhere service.

....

Wednesday, February 20, 2008

Amazon's service failure provides cautionary Anywhere lessons

Saskatchewan Shelf Cloud (Credit: Jeff Kerr and apod.nasa.gov)

For those who started their long weekend early last week, Amazon's storage 'cloud' service goes was offline for about three hours on Friday. When I combine this with the similarly long Blackberry outage earlier in the week, I think there are some lessons worth noting:

  • Outages that seem important to you aren't important to service providers. Too many people assume that by outsourcing their technology challenges, they'll be getting world class service and risk management in return. Based on the quotes of Amazon and Research In Motion executives, those assumptions are misplaced. RIM co-CEO Jim Balsillie dismissed the Blackberry outage as "an intermittent delay, a couple of hours. It's old news." Amazon at least admitted that the downtime was unacceptable, but only did that after customers spent hours searching for the cause of the problem.
  • Cloud services don't guarantee anything. No matter how good those service level agreements sound when you sign them, when the service is down, you're down as well. And if you look very carefully at most of those service level agreements, the penalties for not providing the service are limited to what you are paying that month. That's cold comfort when your business's revenue goes to zero for an unknown period of time.
  • Anywhere services need more than commodity service. Many Web 2.0 startups have staked their future on the hope that cloud computing is "good enough" to propel their business models. But as consumers get used to Anywhere services -- ones that anyone can use on any device on any network -- the more they will be disappointed by garden variety, commodity service. Those companies aspiring to be the next Google should remember that Google started out by building its own massively-redundant infrastructure in closets at Stanford University rather than just piggybacking on university resources. Anywhere reliability and scale will require more than formless cloud infrastructures to work 24 hours a day, seven days a week.

One final note: one of the companies I consider to be a great Anywhere company already is FedEx. While some may argue that it isn't in the Anywhere information business, many of their executives would disagree strenuously, noting that the information they collect on packages and deliveries is just as valuable as the packages themselves. I remember one of the CIOs of FedEx commenting, "Our data center is a lot like Noah's Ark: we have two of everything." And their circa 1996 thinking about contingency planning and reliability of service as documented by Wired Magazine is a great example for companies today to consider:

Behind one of these straitlaced corporate citadels, a low-slung building squats buried under a vine-covered earthworks, shielded by walls of thick concrete. Formally known as the Global Operations Center, it serves as a subterranean command facility for the entire FedEx distribution and delivery system. Employees call it "the Bunker."

The lighting in the Bunker is subdued, and a hushed intensity crackles through the climate-controlled air. On the walls, giant flat-panel projection screens display real-time weather maps of the continental United States, while workstations around the periphery stand equipped with banks of computer terminals and heavy black telephones. A team in the back of the room specializes in domestic operations, and another behind it focuses on surface transportation. Up front is the international unit; a bevy of flight crew dispatchers are positioned off to the left, and there's a handful of meteorologists tucked off in a dark corner.

"It's pretty quiet here now," explains Bunker manager Pete Gwaltney. "But come midnight, the place will be a whole lot busier. At peak periods, we operate in five-minute decision cycles.

"Gwaltney's job is to keep the FedEx distribution network running smoothly despite the inevitable grind of glitches and failures that plague any complex mechanical system. But as he nonchalantly puts it, "This company spends lots of money preparing for contingencies."

To demonstrate the point, he explains how FedEx launches an empty jet freighter each night from Portland, Oregon, bound for Memphis. The jet tracks a course that brings it close to several FedEx terminal airports so that if one of the jets parked on the ground suffers a sudden mechanical failure, the empty freighter can swoop down and pick up the stricken plane's cargo.

The image of that empty FedEx jet streaking through the night reminds me of the old "doomsday" bombers that were kept aloft and on alert during the Cold War. "Jeez," I remark. "It's like Strategic Air Command around here." Gwaltney smiles, as if the same thought crossed his mind a long, long time ago. "Actually," he says, "it's more like Strategic Freight Command."

That's what I think of as the gold standard for Anywhere services. And for those companies who think they can bet their futures and investors' money on cloud-based, best-effort services and compete with companies that think like FedEx, good luck with that. You'll need it.


Monday, February 11, 2008

BlackBerry Service down throughout North America -- again

According to the Huffington Post, BlackBerry service is out throughout most of North America. This is the second outage in the last 10 months and in my opinion, reflects a fundamental flaw in Research In Motion's (RIM's) service strategy. If you want a service to work reliably Anywhere, you can't run every single network connection through a single point of failure. Yet, RIM's network architecture routes every single Blackberry connection through its Network Operating Center in Waterloo. People may claim that Apple's iPhones aren't ready for enterprise use, but at least they don't all stop working when Cupertino has a problem. One big outage is an accident. Two is a design problem. And it's time RIM started fixing theirs.