Server 19 Outage - Monday 20 June 2011 12.27pm

  • Monday, 20th June, 2011
  • 12:27pm

This is a working document to explain what is happening and then it happens.  We will paste a full and professionally written incident report once we have all accounts back on line.  Please refresh this page for the latest information and excuse any typos in this document

Please visit:  http://bwf.co/pir for a post incident report

Update 8.27pm

The server is back on line but we are getting reports it is an old backup.  We do know the cloud provider had started 2 data restores.  They started the platform backup that they managed to retrieve as well as the R1Soft backup.  I assume the platform backup has restored.  We are now looking at syncing the data between this server and the backup server.  Please bear with us.

Update 5.39pm

Our cloud provider just called.  75GB of data has copied across now from the backup.  Not long to go

Update 4.50pm

Our cloud based provider just called to tell me server 19 is progressing well and will be on line soon.

Update 3.18pm

Our cloud based provider have just called me and this is the latest update as we know it.  I was told that our machne was the first to be destroyed this morning and that destruction also took with it the backups.  Therefore we are in a situation where the provider has rebuilt server 19 and is currently syncing all data to the home directory from the r1soft backup server to the newly built server 19.  We have no access to see progress but we have been told about a 4 hour process.  This is where we stand right now.  The provider will call me in an hour to give me an update and I will post more information then.

Update 2.52pm

Please remember server 19 had R1Soft backup solution working on it so all data is safely hosted on the backup server in a different data centre.  We are waiting for an update before looking at options with these backups.

Update: 2.51pm

We are now locked out of OnApp.  Not sure why and we have a ticket in with the provider.  This means we cannot use the console to gain access to the two servers not on line.  We are still waiting for an update regarding server 19.  We are aware we are not the only people affected as two other clients from the cloud provider have been in contact on Twitter to tell us they were affected as well and their servers 'disappeared'

Update: 2.10pm

A client has informed us two of his cloud servers are off line.  We are looking into this now as a matter of urgency as we were only aware of server 19 being off line.

Update:    2.00pm

The staff at our cloud provider are restoring data at the moment.

Update:  1.35pm:

Onapp showing falsely that the servers were off line.  They actually are all on line apart from server 19

Update:  1.25pm

OnApp.com virtual data centre still showing many servers being off line.  This may be reported incorrectly as all servers appear to be up apart from server 19

=========================================================================

To our clients hosted on the Cloud

Please note 95% of clients are not affected by these issues today.  All clients on our traditional dedicated servers hosted in DINEnoc in Orlando and Bluesquare in Maidenhead are not affected.  It is only Cloud based clients that have 100% outage at this time.  (**CORRECTION:  OnApp GUI was reporting all servers as being off line.  Only server 19 was in fact off line***)

I wanted to write this and be totally up front and honest with you regarding the problems we have had of late with the coud based hosting service.  All the recent issues we have had have been with our Cloud based hosting and we have had zero issues with our traditional dedicated servers.

Approximately 6 weeks ago we entered into discussions with OnApp (onapp.com) to explore the possibility of us providing cloud based hosting.  We already had bought a Virtual Data Centre with a  cloud based provider to give the OnApp.com system a road test.  We wanted to explore the system fully before we actually moved forward with our own cloud based server cluster.  The third party provider are a reputable company.   Taking this Virtual Data Centre is not unlike us renting a dedicated server and the service has been good when it works.

There have been a number of issues with this cloud virtual data centre. Whilst I am not trying to pass the buck on any of these issues as we do take responsibility for our buying decisions perhaps with the benefit if heindsight it was premature to put live clients onto this London based cloud.

The following issues have happened:

  • Appriximately 5 weeks ago we ran into problems resizing the disk of server 19 on the cloud.  The disk would not resize and it took a number of support tickets to our supplier who in turn contacted OnApp for a resolution.  This was resolved after 3 attempts and the disk was resized.
  • There was a DDOS attack on the cloud 2 weeks ago resulting in some downtime and the provider did work hard to mitigate that attack quickly and efficiently
  • This morning all our Servers on our Virtual Data Centre were flagged to be destroyed.  We noticed this quickly and called them as well as opened a live chat and they managed to stop this happening.  Unfortunately server 19 has disappeared as well as all our Servers with them being off line.  We are working with them to get these servers back

Moving Forward

Two weeks ago just provisioned the data centre we have used for 8 years and who we have a proven record with to deploy our own cloud solution and I am pleased to say this is going live on 1 July 2011.  This same data centre is where we have some UK VPS servers that have 150+ days uptime with zero downtime.

 

« Back