Server 17 Issues

  • Wednesday, 2nd March, 2011
  • 20:10pm

11.55pm An Update

The server is still on line and still rebuilding the RAID array. I bumped the RAID rebuild priority down so normal disk IO operations can complete and sites and the server can be used normally. The backup we are taking is an rsync of the entire main disk of this server for the reason which we will now explain..   The backup is continuing

The actual problem looks to be with the firmware on the RAID card. However we tried a newer RAID card and it would not see the array. This is most likely due a bug in the older firmware that prevents it from seeing the array unless the firmware is updated on the fly on a live file system. This operation is risky and could make the entire array unreadable. As a precaution we are taking another complete backup of the data before attempting this.

Although the server is online it will still be under heavy IO operations until the rebuild and backup are complete.  No clients have reported any sluggish performance so it looks as if things are working well at this time.

==============================================

4.00pm Latest Update

The server is back on line thanks to the hard work of the staff from the company we rent rackspace from.  The RAID is currently rebuilding and all sites are back on line.  We apologise for any sluggish performance.  At the moment two things are happening.  The RAID array is rebuilding with normal priority and we also have a direct link via a network card to another server where we are making a fresh full backup of all the data on the server.

We will post a full explanation in due course and we thank you for your patience.  If you site hosted on server 17 is not back on line yet please do open a support ticket as all sites should now be back.  A number of clients had us create new accounts for them on different servers and they were restoring from local backups.  If you now prefer to go back to this server just let us know again by opening a ticket.

I thank those clients who did not open repeated tickets and read these announcemennts.  To those clients who have a number of tickets open (some clients have many tickets open) we will answer all tickets but at this time we are prioritising other tickets as we have already informed you of this announcement thread.  Thank you for understanding and our silence at this time simply means we are prioritising other tickets from the 20+ other servers that are not affected.

We really appreciate your business.  Please message us on Twittter or Skype if you wish or open a ticket and all tickets will be answered by midnight tonight.

================================================================

3.00pm Latest Update:

IF ANYONE HAS THEIR DATA STORED LOCALLY AND WANTS TO RESTORE YOURSELVES AND CHANGE DOMAIN NAME SERVERS OPEN A TICKET AND WE CAN MOVE YOU IMMEDIATELY TO ANOTHER SERVER:  http://bwf.co/ticket

We will post a full explanation later but here is what we know happened so far.  One server is affected and 160 client accounts are affected.  Sorry if you are on this server.  This event is a sheer stroke of bad luck.  Our servers are build on RAID10 for resilience in the event of a hard drive failure.  It would appear that the server RAID Card failed and rook out 3 drives in the process.  Why this happened I do not know but it has happened.  It is not something we planned nor is it something we would have ever expected.

The raid card was replaced but the 3 degraded drives were in such a bad state that the server would not rebuild the array.  This has caused the extended delay.

At this time techs from the company we rent rackspace from are workng at the server and have been all day.  They are working to bring a new server on line and get all the data back as soon as possible.  At this time we do not have an estimate but we are ten times further along the road than we were this morning.

I would ask if possible you do not open tickets.  We promise that as soon as we know when the server will be back we will post it here.  Normally with RAID a failed drive will not result in any downtime but on this occasion 3 drives were taken out by the failed RAID card.  This is why your accounts are not on line at this time.

=====================================================

6.30am: All clients have been emailed to the email address on our billing system.

5.45am: 3 RAID disks look degraded - attempting to get the hardware to rebuild the array

4.00am: noc staff preparing an identical server to replace this server

1.00am:  Server is still not recognizing any disks on the array

10.16pm RAID card has now been replaced we will update you shortly

8.39pm:  We are still awaiting information from the noc techs who are looking into this server.

===================================================

Wednesday 2 March 8.00pm

===================================================

Approximately 30 minutes ago our notification system informed us of a problem with Server 17 in the UK Noc.  The server appeared to have I/O issues as can be seen by the results of the simple commands below:

root@server17 [/]# w
-bash: /usr/bin/w: Input/output error
root@server17 [/]# uptime
-bash: /usr/bin/uptime: Input/output error
root@server17 [/]# date
Wed Mar  2 19:39:53 GMT 2011

A reboot has resulted on the RAID card not being able to find any drives on the RAID array.  We are having the noc techs look into this issue now as a matter of priority and we will update you in due course.  Any updated will come via this announcement page so please refresh shortly.

Please do not open multiple tickets or respond to tickets in quick succession.  We are aware of the issue and we are dealing with it as best we can.  Updates will come via this page.  Thanks for understanding.

« Back