How does a battery backup problem cause an outage? That has to be a secondary problem.
Here's what GNAX says about their power backup system:
How on earth does that kind of system lose power? A Navy SEAL team should have trouble taking that out! So what if a backup battery goes bad -- they're supposed to have redundand AC feeds into the building and multiple generators...multiple power generators on a redundant grid with two separate feeds into the building.
Gnax have released an official explanation for this, im sure glowhost will update shortly. To cut a very very long story short it was a series of freak events. A main utility company supplying the datacentre failed causing a complte outage, the backup generators kicked in an most things rebooted. However both battery systems failed (the main and redundant) causing the complate failure. Once running multiple hardware failures and router problems caused a catalog of problems which had to be fixed.
That is a very very short version of what happened!
bdominick -
Just wanted to add my two cents about dealing with your clients - especially as a fairly new provider:
Be confident that you are offering a quality service.
Be of the opinion that mechanical devices can and will fail, regardless of the precautions we have in place.
If your clients go elsewhere, chances are they will return - especially if you have been forthright about the issues they originally experienced.
I've gone as far as "helping" my clients find another qualified service provider and have yet to lose a client when I've sent them the information. I just keep doing what I do - I'm confident that I am providing an awesome service.
Thanks a lot, Lynne. That's really helpful advice!
Hey Andy, where do you find notices from GNAX? I'd like to see the long version of this explanation.
From the short version, it sounds like GNAX has been deceiving its customers, which is very disconcerting. If you have 2 backups and they both fail, you actually had zero backups. You can't promote backup systems that aren't working. And how did the two grid connections both fail in the first place? Probably because there was only one...
GNAX has a forum.
I visited the datacenter, and it is pretty damn impressive! The guys I met are all real people.
However, if you read the long version you will see the long list of comedy of errors.
I can relate to their issue, it was a great embarrasment for them. When I visited they were so proud that they had it all figured out.
I have been working on a failover system for a client with 17 servers. I can relate. The only way to test it is to flip the switch, which is hard to do when you are live. So you wait and pray. Then judgement day comes, and you find all the loopholes. And then you patch them and lick your wounds.
However, when both battery units failed they fired up the backup generator and much of the power was back online.
But .... when a computer goes down unexpectedly, bad things can happen to the file system as I said, delaying the boot. When it is bad enough, your OS may get hosed.
When the power fluctiuates on and off from load spikes, routers can lose configuration settings. Suddenly you have power but data is not getting routed.
I am a moderator and a glowhost client. I've done my share of ragging on GNAX. But I think we will see a much more stable DC after this. They had every employee called in to deal with this. From my experience with hardware, this was probably pretty ugly for them.