There are some problems with network at the datacenter.
Our main server was down for a while that's why we were unable post anything here.
Some servers are still down and we're contacting the NOC technicians regarding this. Updates will follow.
Printable View
There are some problems with network at the datacenter.
Our main server was down for a while that's why we were unable post anything here.
Some servers are still down and we're contacting the NOC technicians regarding this. Updates will follow.
Looks like all GH servers are back up as of a minute or so ago. Can you tell me how often this kind of thing happens, and how long it was down total (the network monitor only goes back a couple of hours)? Looks like we were down for longer than GNAX's promised 99.999% uptime guarantee. Are there repercussions to them for that? I had 3 clients call me about their sites. One of them is very new and sounded very pissed.
Yes, all server are up now. If you are still facing any problems with your site please contact our support team.
bdominick: I'm not sure about the refunds so please ask Matt about that.
Well, GNAX didn't guarantee me anything directly, and I think GH's guarantee is a little more sensible. I'm not looking for a refund, I just hope you guys stick it to them on our behalf. It looks like there might have been some kind of power outage -- I can tell my server rebooted. To read GNAX's documentation, you'd think that was impossible...
FYI - I read from GNAX a week ago that level 3 was servicing a router this morning. Probably had a lot to do with it.
Welcome back GH, Everyone has to understand this was a problem with the GNAX network and as such out of GH hands. However I would like to add some constructive critisism (below). Any word from GNAX as to why a failure of this magnitude happened? It took nearly 5 hours from the problems to be rectified!
Constructive critisism:
Please please GH could you get a hosting account in another datacentre which would host an emergency blog during such a situation? Just a thought!
Yeah, but since we GH clients cannot directly hold GNAX accountable, I do think it's within our right to expect GH to do that, by whatever means available. My clients don't give a **** if it's GNAX or GH or GOD to blame -- they are complaining to me. I can pass the buck all I want, but if my clients leave my company, I'm the one who pays, literally, so I want to know that GNAX gets hell for this kind of thing. I would contact them directly, but they don't know me from the next guy, and I don't write them checks.
I agree entirely, i think someone has to take this up with GNAX, and that someone is the person that pays gnax's bills, GH.
I hope your clients dont leave I really do feel for you I got of lightly as my server just blipped off then straight back on so wasnt down for more than about 5 minutes. However as i said 5 hours is a crazy amount of time, i would have been tearing my hair out especially as the main GH site was also down so information was limited.
I don't think anyone's going to leave me. I just point them to GNAX and tell them they're a huge datacenter and it could happen to anyone, etc, etc. I mean, anywhere they go, presumably, this COULD happen. It's just that my whole business is new, so all of my clients are somewhat new to me, and they're worried this might happen often. But I went through this forum and I can see it's only happened a few times over the years, and doesn't seem to have happened at all in 2007, so I feel reassured and I can in turn reassure them. I'm just glad I didn't give them a 99.999% uptime guarantee! Truth be told, I'm more likely to cause a long-term outage than GNAX is, hahaha.
Apparently, it was a battery backup unit that went bad is what I last heard.
I have 4 servers on that same unit. One cost me the whiole morning to repair.
I am not sure that it is even a GNAX problem. Which server was down 5 hours?
One big problem is that cpanel uses the ext3 filesystem instead of XFS or some other high reliability filesystem.
Depending on the situation, some of my servers have taken HOURS to come back on line - think back powering off your PC by accident and DOS or windows running scandisk for a week!
So with many linux file systems, if power is unexpectedly lost and it detects invalid dir entries, it runs fsck at startup to rebuild the filesystem and check for errors.
How does a battery backup problem cause an outage? That has to be a secondary problem.
Here's what GNAX says about their power backup system:
How on earth does that kind of system lose power? A Navy SEAL team should have trouble taking that out! So what if a backup battery goes bad -- they're supposed to have redundand AC feeds into the building and multiple generators...Quote:
multiple power generators on a redundant grid with two separate feeds into the building.
Gnax have released an official explanation for this, im sure glowhost will update shortly. To cut a very very long story short it was a series of freak events. A main utility company supplying the datacentre failed causing a complte outage, the backup generators kicked in an most things rebooted. However both battery systems failed (the main and redundant) causing the complate failure. Once running multiple hardware failures and router problems caused a catalog of problems which had to be fixed.
That is a very very short version of what happened!
bdominick -
Just wanted to add my two cents about dealing with your clients - especially as a fairly new provider:
Be confident that you are offering a quality service.
Be of the opinion that mechanical devices can and will fail, regardless of the precautions we have in place.
If your clients go elsewhere, chances are they will return - especially if you have been forthright about the issues they originally experienced.
I've gone as far as "helping" my clients find another qualified service provider and have yet to lose a client when I've sent them the information. I just keep doing what I do - I'm confident that I am providing an awesome service. :)
Thanks a lot, Lynne. That's really helpful advice!
Hey Andy, where do you find notices from GNAX? I'd like to see the long version of this explanation.
From the short version, it sounds like GNAX has been deceiving its customers, which is very disconcerting. If you have 2 backups and they both fail, you actually had zero backups. You can't promote backup systems that aren't working. And how did the two grid connections both fail in the first place? Probably because there was only one...
GNAX has a forum.
I visited the datacenter, and it is pretty damn impressive! The guys I met are all real people.
However, if you read the long version you will see the long list of comedy of errors.
I can relate to their issue, it was a great embarrasment for them. When I visited they were so proud that they had it all figured out.
I have been working on a failover system for a client with 17 servers. I can relate. The only way to test it is to flip the switch, which is hard to do when you are live. So you wait and pray. Then judgement day comes, and you find all the loopholes. And then you patch them and lick your wounds.
However, when both battery units failed they fired up the backup generator and much of the power was back online.
But .... when a computer goes down unexpectedly, bad things can happen to the file system as I said, delaying the boot. When it is bad enough, your OS may get hosed.
When the power fluctiuates on and off from load spikes, routers can lose configuration settings. Suddenly you have power but data is not getting routed.
I am a moderator and a glowhost client. I've done my share of ragging on GNAX. But I think we will see a much more stable DC after this. They had every employee called in to deal with this. From my experience with hardware, this was probably pretty ugly for them.