Shared Server Disaster Recovery: Feb 2013

We had a staff meeting yesterday to discuss our 'Disaster Recovery' procedures for shared servers

We thought we would publish our notes from the meeting as some of you may be interested in reading this.

==================================================================

What would happen if a BWF Shared Server totally failed?

If we receive notification that one of our shared servers was having issues in the first instance we would have technicians in the data centre despatched immediately to check the situation.  They would immediately start work to repair the issue and we would keep clients informed.  We would also start posting to our announcements page and we would Tweet and post to Facebook.

If after a few hours it became apparent there was likely to be a more serious issue we would tell these technicians to keep working but our disaster recovery plan would swing into action summarised below.  (Please note this recovery is based on the fact accounts would be restored to a new server with new Ips.  We feel clients would prefer the inconvenience of an IP change that be delayed for a long time).  Remember the technicians would still keep working on the failed server as this is the quickest way to recover and the bullet points below are in addition to any work on the failed server:

  • We would start our script (available to all our techs in our repo) to package up the backups in batches of ten at a time ready for transfer to a new server.  The fact we package them in batches means after a short period of time we would be in a position to start restores (after copying them in batches of ten to the new server).  This script will ignore huge hosting accounts and these need restored manually at the end.


  • We would deploy a new server, install CentOS, install and harden Cpanel, Configure and License Cpanel.


  • We would then start copying the backup packages 10 at a time to the new server and our script will restore them at the same time.


  • If the technicians cannot get the old server working again or if there is going to be a delay management will make a decision to switch the DNS to the new server. We do not take this decision lightly as clearly the moment we do this clients’ websites will work BUT will be restored from a daily backup so there may be a little data loss between the time the backup ran and the time the server failed. Clients using our name servers will need to do nothing.  Clients using A Records will need to change their IP address at their domain registrar.


Even with our Current Disaster Recovery Plan in place with the weaknesses we identified if we had a complete server failure we would have everything back running within 24-36 hours max.


How can we speed this up / Improve?

  • By the 10th February we will be deploying a spare empty server that will always be on line ready to take accounts.  Presently we would wait until we had a server failed before deploying a new server.   This will speed up the restore process by at least 4-6 hours (the time it takes to install and configure a new server).  This server will sit empty and ready to receive accounts from any failed server.


  • Only 50% of our shared servers have remote backups at the moment.  The other 50% have backups being taken to an extra hard drive located in the same chassis of the server.  This is potentially going to introduce a delay as if the backup drive is inside the failed server we would need to get a technician to remove it from the failed server and install it in another server before we could access the data.  We are going to look urgently at making sure ALL our shared servers have Remote Backups.  This will be at a slight cost that we will absorb internally.  We plan to have this implemented across all shared servers by mid March 2013. We are also going to look as an alternative having the 'spare server' sitting with no drive so instead of implementing remote backup across the board we can quickly pop the backup drive into this spare server to start restores right away. This may be faster as it would save the need to package them up on the remote server and copy the data over usinf scp (secure copy). We are to continue discussing this.

  • Moving forward we will reduce the time limit before we start packaging up the backups to 2 hours.  Currently we wait about 4-5 hours before doing that.  The thinking is we would rather package up backups we do not need and delete them when we discovered the server was fine.


What if the BWF website and Billing System went off line?

We strictly host our main website and billing system on a separate server from clients.

We currently host our main website on a 4 year old server.  Whilst it is rock solid and we are loathed to move from something that works and has never had outages moving forward it seems prudent we move that to a totally separate new server and we plan to complete this within 6 weeks.

Our current BWF Contingency Plan is shown below and then we will look at how we can improve this moving forward and set a time frame:


  • We would start packaging up the backup the instant we had a server failure.  Last month for example our server hosting our website and billing system needed a drive replaced in the RAID array and the array needed to rebuilt.  There was no downtime during this as the drive was hot swappable.   While that completed and purely as a precaution while that completed our techs were already packaging up the backup from the daily backup and making the up to date database available from our hourly database backup..  It clearly was not needed and a RAID rebuild is a common thing but we wanted to have a backup plan.

  • We would deploy a new server, install CentOS, install and harden Cpanel, configure and license Cpanel

  • We would make some internal changes such as WHMCS license IP, ENOM API IP etc etc to make sure the website was functional

  • We would update the DNS and bring our website and billing system back on line.


We are very confident we could have had our main website and billing system back on line within 6 hours.


How can we speed this up / Improve?

We are still discussing how best to proceed to give really fast recovery in the event of failure. Suggestions from our notes from the staff meeting are below


  • By the 10th February we will be deploying a spare empty server that will always be on line ready to take accounts.  Presently we would wait until we had a server failed before deploying a new server.   This will speed up the restore process by at least 4-6 hours (the time it takes to install and configure a new server).  This server will sit empty and ready to receive accounts from any failed server.

  • We are looking to perhaps deploy a VPS server and we will have a ‘spare’ copy of our website and client area located there simply in a suspended state.  We will perhaps write a script to rsync the database a few times per day to this ‘spare’ server or we will look at implementing a custom MySQL replication setup.  This would allow for almost immediate recovery in the event of a failure.

  • A staff member has suggested the site on an svn repo for quick deployment.



Staff Resources to handle a Large Outage


All our staff are 100% committed to work extra shifts if needed.  This is not a requirement and the last outage one tech came back into work after dinner without telling us and started working again to help our clients.  Outages are rare and the voluntary ‘flexible’ approach to working means staff can get extended time off with their families on quiet times in return for working during outages.

We also have access to many techs from our remote support company in India to handle busy times and the management there can deploy technicians for us at short notice.

Giles our Senior Support Admin is USA based so during an outage it means that either Stephen K or Giles W can be on hand throughout the night.  Giles working from USA and the time difference can allow Stephen to get needed sleep to take over next morning. Russell a local staff member is taking more of a role with us now and he is also available.

Finally it should be noted that nearly all our servers are provided by our long term partner Hostdime USA and Hostdime UK.  They are a global hosting company and they have the strength of 300+ staff members in the event of outages.  All our servers are on a management contract with them for hardware replacement and they are very professional.  Knowing this should give you real peace of mind.

 

 

 

 

 

  • Email, SSL
  • 11 Users Found This Useful
Was this answer helpful?

Related Articles

 VPS Disaster Recovery SLA

Bigwetfish - Disaster Recovery SLA for VPS ClientsHardware faults are thankfully rare but this document explains what we will do and in what time frame we will do it should a server be down due to...

 Account Restores

If you are on a shared hosting account and delete a file or database in error or need something restored the first port of call will be to open a helpdesk ticket.  We usually have at least one...

 Compromised Account Recovery

Hacked accounts are rare and usually can be traced to out of date scripts or plugins on your hosting account.  Please always keep your scripts and plugins up to date at all times.If you do discover...