Welcome to AppFail
You last visited: never

Welcome to AppFail

Posted on 2009-07-07

Infrastructure Post

Rackspace customers have become frantic about the lack of so called "Fanatical Support" as they have suffered for over a week with power and network outages

Rackspace prides it self on having the best support and infrastructure in its market segment, and charges a fairly high premium for its services, based on this reputation. Recent events however, have shown a lack of redundancy in a specific part of Rackspace's infrastructure, specifically its electrical system. A excitation failure in a generator caused an entire wing of the data center to lose power for a fairly length period, and two maintenance windows were required to effect repairs. Then later a failure in a UPS (that customers were assured had just been recertified) caused a network outage to a bank and a half of servers.

When I worked at Ontario Power Generation, losing power to the data center was actually a very common occurrence, especially considering that there were 8 500mw generators on site, powering a huge portion of southern Ontario. Our electrical system consisted of a specialized switch that automatically tried to get power from two separate feeds from the generator house and then a central UPS bank. Additionally, each server was connected to its own individual UPS to cover small outages and to handle graceful shutdown of the servers during prolonged outages.

Obviously what Rackspace needs to consider is moving away from solely relying on centralized UPS banks, and adding an additional layer of protection with rack level UPSs. Another route is the one Google chose, moving away from centralized UPS solutions all together, and for the sake of efficiency, moving away from the AC to DC to AC to DC conversion you run through when you deliver mains power (AC) into a UPS battery (DC), then deliver it to a server (AC) where the power supply converts it to DC used by the components. Google embeds a UPS battery (DC) directly into each server, and with a $2 change to the motherboard, the server can draw power directly from the battery when AC power is lost.


blog comments powered by Disqus

Cuiusvis hominis est errare; nullius nisi insipientis in errore perseverare - Any man can make a mistake; only a fool keeps making the same one.

Digg Proof Hosting
The key to surviving Digg and Slashdot is Infrastructure. You can't get it from a regular web host, it requires experience. The High Load Hosting Experts at ScaleEngine can make your site thrive, and avoid having your site featured on AppFail.

Cyber Security Alerts

Page Generated in 15ms