Post Incident Report: Server 19 Outage (dediserve.com)

  • Monday, 20th June, 2011
  • 20:54pm

Good Morning

I wanted to spend some time writing a post incident report to explain the situation yesterday with server 19.  We will always be totally up front and honest with our clients and this report will go into detail to explain what happened and what we did about it.  We will then outline what we are doing moving forward.

Approximately 8 weeks ago we signed up with a cloud based server provider who operate a cloud in Maidenhead in Berkshire.  This was the first time we had used this particular company and was a big step given the fact we had used another comnpany for many years and had built up a really good relationship with them.  The company providing the cloud are a reputable company who are pretty well known in the Irish hosting market and who have ambitions to become a large cloud based provider worldwide.  Although they are a relatively new company they have many years experience in the industry before this company was formed.  We did trust them and we still do trust them.  Although the incident yesterday was serious they handled it professionally.

At lunchtime yesterday we had a report from a client on Skype that their website was down.  This website was on server 19.  We immediately tried to ping the server but there was no response from the server.  We logged into the onapp.com control panel and we saw that all Servers in our Cloud were in the process of being destroyed.  We immediately opened a live chat to this company as well as call them to tell them they needed to stop this immediately.  It took a few minutes from us noticing this to the 'destroy' commands being stopped.  We lost 2 servers from our Cloud Data Centre.

After investigation it was discovered we had not been compromised in any way.  The issue was caused by our Cloud based Provider giving their billing provider access to their systems as they were working to integrate the OnApp API into their website. We were told a sloppy programmer made a typo in an API call and instead of issuing the command to destroy one Server the command was issued to destroy all servers.  We were told by the cloud provider that 6 servers (2 belonging to us) were destroyed by this command before it was stopped.  The cloud provider have taken steps to ensure this cannot happen again.  We pointed out to the provider in an email that it was our opinion that it was sloppy practice to allow a developer integrating API calls to have access to a live platform with live servers, let alone allow that developer privileges to destroy all servers.  This has been taken up with OnApp as an issue.

A technnician from the provider was assigned to us at 1pm yesterday and he worked until well after midnight to get the two servers back on line.  The vast majority of sites should now be back on line and we only know of two sites that are 'missing'.  Our technicians are working now to ensure these other accounts get restored.

We did have R1Soft.com backup solution on server 19 and this is sold as 'continuous data protection'.  We do have data from up to 40 minutes before the server was destroyed.  If you notice any files missing please do open a support ticket and we can restore these for you.  Alternatively just log into cpanel, select R1Soft and you can restore these files yourself now from a web based interface.  R1Soft was setup to back up the data to a different data centre so even though data was destroyed on the Virtual Machine we did have a very recent copy of your data secure on another server.

For those clients who are new to us I can assure you we take uptime seriously, have been in business for years and have some servers that have 6 months or more continuous uptime.  We need to work to make sure server 19 is the stable environment you have paid for and rightly expect.  If anyone feels they want to move from server 19 to a 'traditional' dedicated server just open a support ticket and we will gladly move you to another server striaght away.  The theory behind the cloud is there should be higher availability in terms of hardware failure as the cloud has hardware redundancy built in.  On a traditional server if a power supply fails and it takes 90 minutes to replace it then you will have 90 minutes outage.  On a cloud if one server develops a fault the platform will automatically keep the sites on that server on line by hot migrating to another server in the cloud in a matter of seconds (as data is stored off server on a SAN).  Yesterday the platform did not fail it was due to human error caused by the billing provider of this cloud based provider.  We still firmly believe that the cloud based setup is good in terms of removing a number of points of failure (specifically hardware failure).

Moving forward we are actually deploying our own Cloud in Maidenhead on 1 July 2011.  This brings benefits in that we have full access to all servers and are not simply a reseller of another company.  We have no plans to move server 19 elsewhere as that would cause further disruption but we will review the situation if there are any near further issues with server 19 on the current platform.  We will however move any client off server 19 on request to another non cloud based server - just ask.

Thanks for taking the time to read this.  If you have any questions at all please open a support ticket.  I will personally answer any questions or concerns you have.

Regards

Stephen K
BWF Hosting

 

« Back