Page 1 of 2 12 LastLast
Results 1 to 10 of 17

Thread: A big Power Blip at GNAX Datacenter

  1. #1
    Matt's Avatar
    Matt is online now GlowHost Administrator
    Join Date
    Jan 2005
    Location
    Behind your monitor
    Posts
    6,206

    Default A big Power Blip at GNAX Datacenter

    We still do not have official news as to what happened, why it happened, and what will be done to prevent it in the future.

    What we do know is that there was a power outage that lasted for about 5 minutes at GNAX. Unfortunately when this happens it causes hard reboots on all of the devices affected. While 905% of the devices were overall, unaffected (except the reboot) some of them have gone into fsck which is something that Linux does to make sure that the filesystems are not damaged and repairs them if they are.

    Right now Vern and Ratbite are the only 2 remaining servers in this status.

    If you have any VPS, it means you are on Vern and it is running a disk check. Vern was also scheduled for another upgrade this week, but while this machine is down, we may take advantage of the downtime and get the upgrades installed on it this evening. We have to see what the workload looks like and what sort of damage we are looking at on the disks for this unit.

    We have techs looking at both devices right now and will have them back online as soon as possible.
    Send your friends and site visitors to GlowHost and get $125 plus bonus!
    GlowHost Affiliate Program | Read our Blog | Follow us on X |

  2. #2
    jmarcv's Avatar
    jmarcv is offline Cranky Coder
    Join Date
    Jan 2005
    Posts
    354

    Default From GNAX

    5/20 Power Outage RFO
    Severe storm cells came through North Georgia Region this evening. AtlantaNAP experienced an over current fault outage on one of our 2 main feeds. The feed is the original feed that has the most load currently connected to it. The amount of systems connected to the load is the amount of lightning and over current that will try to be passed to the system – i.e. if you don’t have very much load on it - like our new feed is currently only at 1/6th load - then current does not try to flow to it very much. Our first system is currently at 65% load so it tried to absorb much more of the lightning strike than the other one and hence the main breaker going into over current fault.

    I have spoken with all of our key electrical engineers associated with the building at this point. According to Georgia power / our PSSI and Cummins engineers – we likely took a lightning strike to the utility very near the facility which caused an over current fault on our main incoming breaker on our first set of switchgear. The breaker is designed to trip in the event of this kind of fault to protect the gear (your computers) inside the building from being burned up by the lightning strike.

    When this type of fault happens - the computer will not start the generators until an engineer verifies where the fault is. This is because a fault inside the wiring plant could also cause this kind of over current in the event of a main short if a feeder wire of main current in the building were to become damaged.
    In that case it would be very dangerous to turn the power back on manually or to force a manual start of the gen sets and push current to the system with a fault remaining. Lives and machinery could be lost.

    We dispatched several of our staff visually to inspect for faults – (we did not want to turn something on and have it fry everyone’s gear) and found none and verified it was likely a lightning strike and manually started the generators to restore power. Unfortunately the ups system is only designed to carry that load for 10 minutes which was not enough time for us to safely verify and do a manual start.

    This is apparently a rare event – to get a direct utility strike like this – that close that does not get dissipated before it hits us. The farther away from your site the strike occurs - the more other load and grounds it has to dissipate before it gets to you.

    The good news is we did not burn up any equipment.

    Some of you did not lose power because you were connected to the other lightly loaded feed coming in and it was not enough load source to overwhelm the breaker since it is only 18% loaded at this point.

    Some of you lost network connectivity because downstream feeder switches that your computers are connected to are only single power supply units.

    We are in the process of examining a facility wide network upgrade that will move to a newer chassis based solution throughout the facility - we started looking at this as a way to offer new services capability that many f you have been asking for - it is a costly upgrade and will bring redundancy but also brings some pitfalls as well since you have more connections into a single chassis. We are still looking at this currently and will keep you up to date as to the direction we decide to move.

    They have told me that under normal operating conditions there is really nothing we could have done and we should simply be glad we had good equipment installed that kept our computers from being fried.

    I am thankful that I am not looking at a lot of damaged equipment that could not simply be turned back on - that would be a disaster I do not want to deal with. At this point it seems like the new switchgear with over current protection was a good investment.

  3. #3
    Matt's Avatar
    Matt is online now GlowHost Administrator
    Join Date
    Jan 2005
    Location
    Behind your monitor
    Posts
    6,206

    Default

    From the techs at the DC:

    Vern is running a fsck onthe VZ directory. Ratbite is up but cannot get it to network to the outside they are checking on it now.
    Send your friends and site visitors to GlowHost and get $125 plus bonus!
    GlowHost Affiliate Program | Read our Blog | Follow us on X |

  4. #4
    omarfilip's Avatar
    omarfilip is offline Nearly a Master Glow Jedi
    Join Date
    Jan 2008
    Location
    Dallas, TX
    Posts
    127

    Default

    Any news why vern is taking so long to come online?

  5. #5
    Matt's Avatar
    Matt is online now GlowHost Administrator
    Join Date
    Jan 2005
    Location
    Behind your monitor
    Posts
    6,206

    Default

    Fsck takes a long time on VPS because VPS servers are usually:

    very large sites + an OS + cPanel * "X" servers

    which take up a lot of disk space that need to be checked. Basically its like checking 15 to 20 or so dedicated servers and it cannot come back online until Fsck checks them all.

    Most people on VPS are on the upgrade path to dedicated and that is why they have such large sites.

    It is one of the largest drawbacks to VPS hosting IMHO.
    Last edited by Matt; 05-20-2008 at 10:58 PM.
    Send your friends and site visitors to GlowHost and get $125 plus bonus!
    GlowHost Affiliate Program | Read our Blog | Follow us on X |

  6. #6
    Websync is offline What's a Guru? I want to be a GlowRu!
    Join Date
    Oct 2005
    Location
    California
    Posts
    55

    Default

    How long is long? Plus, how long have they been down? The one day I decide to take a few hours off.

    Curious minds want to know if Vern will be back up and running by the start of business tomorrow - PST.

  7. #7
    omarfilip's Avatar
    omarfilip is offline Nearly a Master Glow Jedi
    Join Date
    Jan 2008
    Location
    Dallas, TX
    Posts
    127

    Default

    Down since 7:24 PM Central time

  8. #8
    Websync is offline What's a Guru? I want to be a GlowRu!
    Join Date
    Oct 2005
    Location
    California
    Posts
    55

    Default

    Thanks for the reply omarfilip.

  9. #9
    Matt's Avatar
    Matt is online now GlowHost Administrator
    Join Date
    Jan 2005
    Location
    Behind your monitor
    Posts
    6,206

    Default

    As you may have noticed vern is finally done with fsck. If any of you have problems please open a ticket. Alex is standing by waiting for your problems. We would have updated you a little sooner about the server being online but we were busy putting out other fires.

    I am sure most of you noticed the machines back online. Sorry that it took so long to respond.
    Send your friends and site visitors to GlowHost and get $125 plus bonus!
    GlowHost Affiliate Program | Read our Blog | Follow us on X |

  10. #10
    Matt's Avatar
    Matt is online now GlowHost Administrator
    Join Date
    Jan 2005
    Location
    Behind your monitor
    Posts
    6,206

    Default

    Still working on getting parts for Ratbite. Looks like some of them were hosed in the power fault. We haven't forgotten about you. We will try to get it solved before the phones start ringing for you.
    Send your friends and site visitors to GlowHost and get $125 plus bonus!
    GlowHost Affiliate Program | Read our Blog | Follow us on X |

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14