Crucial Paradigm

Call us now 1300 884 839

Member Login
Australia USA

Location: Australia

Soar with High Availability Web Hosting from $11.95/month Elastic Self-Healing Windows VPS from $54.95/month State Of The Art Infrastructure Powered By Hewlett Packard and Cisco R1 Soft CDP Backup Solution
  #91 (permalink)  
Old 28-12-2009, 01:41 PM
Junior Member
 
Join Date: Oct 2009
Posts: 5
Default

From my understanding - Ijan, Divya et al, feel free to correct me if I'm wrong - this what happened, in rough order:
  1. Single disk failed (one of two) in RAID 1 array, containing OS. Emergency maintenance was scheduled, notice was sent to all affected
  2. Hotswap of new drive was made, but for safety reasons (and because it's faster), VPSs were scheduled for downtime during array rebuild time
  3. During array rebuild, second disk in OS array failed, destroying all data on array
  4. Failed machine was brought up on emergency OS, not running VPSs, but just to retrieve data
  5. Second machine was setup, with matching hardware of s332.au and VPS-ready OS installed. This would be the preferred option, especially since rebuilding on failed hardware is a big no-no. Since you can't just pickup a RAID array from one machine and plug it into another, at this point, they would have started copying the customer data from the failed machine to the new machine over the network.
  6. Each VPS copied one at a time to new machine, (~30 VPSs at ~25+GB each - lets say 1TB in total for guesstimation sake)
    Each VPS over network takes time (best case would be 1gbps network), and having to reconfigure each VPSs Xend files to operate would also take a bit of trickery - this data really should have been backed up with the customer data. Also, at this step, as the recovery progressed past each VPS, the copying process would have become slower and slower, as more and more VPSs were fighting for disk and network I/O
  7. Christmas meant CP staff had to work around the clock on this, plus handling a massive increase in support tickets, calls, forum posts etc, which would subsequently take eyes off the recovery process

I don't think a RAID 1 for the OS is suitable, and the Xen configuration files should be remotely backed up (very easy to do) but the VPS data moving to new hardware was definitely the downfall of this whole shemozzle.

I definitely agree with not being notified further as well. Some of us knew to look at the forums, but a lot wouldn't have and it's such a simple thing to send mass emails these days that it would have saved a lot of confusion with everyone. I do really look forward to the full post mortem on this.

In all, I can wholly understand the process taking the amount of time it did. Not happy about it, but I can understand it. This was one of those absolute worst case scenarios that all sysadmins dread, and I really do not envy the CP team in the least.

Last edited by silvervest; 28-12-2009 at 01:44 PM. Reason: more info
Reply With Quote
  #92 (permalink)  
Old 28-12-2009, 05:10 PM
Junior Member
 
Join Date: Dec 2009
Posts: 5
Default

The one thing that really, really annoyed me with all this was that we received an invoice for the VPS that we couldn't use. I realise it's probably an automated system, but it was just really frustrating to get reminders to pay for something that we weren't actually able to use. And we still can't use because of another open ticket not being able to be actioned correctly yet. (We mistakenly forgot to ask for a 32bit VPS during the order process, and I have been trying to get it changed for the last couple of weeks).

I applaud the Tech guys dedication to get everything running again, especially in the lead up to Christmas, as well as the generally good communication going on on these forums, and hopefully CP will learn from this and put better RAID and backup recovery options for the host OS
Reply With Quote
  #93 (permalink)  
Old 28-12-2009, 06:08 PM
Member
 
Join Date: Mar 2009
Posts: 34
Default

Quote:
Initially we were replacing a failed drive in the OS array in the server however during the raid rebuild the other drive started showing I/O issues and the rebuild failed, there are two RAID arrays in this server a RAID 1 array with 2 drives for the OS and a RAID 10 array with 8 drives for the virtual servers.

We did first attempt to bring the server back online with 1 drive running the OS so we could then migrate all the VM's to another server however due to the I/O issues the drive was sitting in 'read' only so we couldn't restart the virtual servers. The fasted solution at the time was two move the drives to another server that we had on standby so the decision was made to do this around 2am.
By Ijan - Posted on Whirlpool
Reply With Quote
Reply

Tags
s332

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT +11. The time now is 06:18 PM.

Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.5.2

Copyright 2003-2010 © Crucial Paradigm Pty Ltd, All Rights Reserved