Server Issue 27 December 2009: 66.7.205.148

  • Sunday, 27th December, 2009
  • 11:10am

After Event Information

Please note this information is about the server with IP:  66.7.205.148 - no other server was affected.

I wanted to take the time to personally apologise to customers on this server that had issues.  This was a unique situation and is certainly one that has never happened to us in the past and we hope will never happen again in the future.

As you know 13 days ago the server hard drive failed and the server had to undergo an emergency restore.  All issues were resolved and all sites were put back to how they were before the crash.  The last remaining issues that were reported to us were on 22 December so we had planned on 26 December to move the server back to the floor of the data centre and reinstate backups.  Two hours before we were to do this the new hard drive showed errors.  This was noticed when some customers complained about Internal Server Errors.  A scan of the server hard drive revealed DMA errors and we tried to get the server back up but were unsuccessful.  The server did boot back up in rescue mode.  We contacted the Network Operations Centre and they installed a new drive into the server and we began copying the data from the bad drive to the new drive.  We had anticipated this would take no more than 12 hours but we soon realised that the bad drive was so bad that the file transfer was really going at a slow pace.

It was at this point we took the decision to install another hard drive into the server and restore from the old backups.  This got all sites on line with older data but at least sites were resolving.

We continued to copy across the files from the bad drive and this completed this afternoon.  At this point we synced the files onto the server from the recovered drive and the sites were put back to how they were on 26 December 2009.  We then synced any email that had been received from 27 December.

I sincerely apologise for this downtime.  This was a unique event - I have never seen 2 hard drives on 1 server fail within 12 days of each other.  I thank you for your extreme patience.  Ironically this server has had an excellent uptime record and we hope this was only a temporary blip on its record.

We are starting plans now to migrate all servers to RAID 5,  All our new servers have RAID 5 as standard and all existing servers will be migrated within 6 months.  RAID 5 gives much greater resilience in the event of a hard drive failure with usually less than 60 minutes downtime on a server hard drive failure.  The problem is there will be a few hours downtime while we change to RAID 5 plus there may be other associated incompatibility issues that need ironed out.  All sites will be made to work but there cold be a few hours of downtime.  For this server we need to decide if you the customers hosted on this server want to wait 3 months (so there is no more downtime soon) or if you prefer to do this as soon as possible.

Any opinions from our loyal clients on this server will be taken into consideration so please do drop us a support ticket.

If you have any specific questions about anything written above please do not hesitate to contact me personally by emailing:  [email protected] (my personal email address).

Regards

Stephen K
BWF Hosting

=====================================

Update:  All issues have been resolved.  A new hard drive has been installed and all data has been successfully synced.  The server is working normally at this time.

=====================================

 

Server IP:  66.7.205.148

Dear Customer

Here is the situation we are faced with today. As you know the server you are on had a major hardware failure 12 days ago and the hard drive was totally corrupted and had to be restored.

Last night the server hard drive failed again. We do have many many servers and this is something I have never seen happen before. I completely understand your frustration. Please realise we are as frustrated as you are. You are paying for a service that we are currently not offering. Unfortunately this 12 day old drive has failed too so this is why the server is down.

9 hours ago the data centre staff were trying to get the server back up but the drive was showing DMA errors. They could only get the server up in a rescue environment and began trying to copy the data from this server drive to a brand new drive. This data copying is still progressing slowly.

We do have in excess of 20 servers ourselves and a number of servers belonging to customers that we manage and this situation where 2 drives fail within 12 days of each other is something I have never seen before.

Moving forward we will be taking steps in the new year to move all servers to a RAID environment which provides much more resilience in the event of a server hard drive failure. We will be in touch in the new year with more information regarding that.

Please be assured we are working hard to get the sites back up as quickly as we can once again. I do not have a time frame yet for the sites coming back on line.

I sincerely apologise for any inconvenience caused.

Regards

Stephen K
BWF Hosting

« Back