We had a staff meeting yesterday to discuss our 'Disaster Recovery' procedures for shared servers
We thought we would publish our notes from the meeting as some of you may be interested in reading this.
What would happen if a BWF Shared Server totally failed?
If we receive notification that one of our shared servers was having issues in the first instance we would have technicians in the data centre despatched immediately to check the situation. They would immediately start work to repair the issue and we would keep clients informed. We would also start posting to our announcements page and we would Tweet and post to Facebook.
If after a few hours it became apparent there was likely to be a more serious issue we would tell these technicians to keep working but our disaster recovery plan would swing into action summarised below. (Please note this recovery is based on the fact accounts would be restored to a new server with new Ips. We feel clients would prefer the inconvenience of an IP change that be delayed for a long time). Remember the technicians would still keep working on the failed server as this is the quickest way to recover and the bullet points below are in addition to any work on the failed server:
- We would start our script (available to all our techs in our repo) to package up the backups in batches of ten at a time ready for transfer to a new server. The fact we package them in batches means after a short period of time we would be in a position to start restores (after copying them in batches of ten to the new server). This script will ignore huge hosting accounts and these need restored manually at the end.
- We would deploy a new server, install CentOS, install and harden Cpanel, Configure and License Cpanel.
- We would then start copying the backup packages 10 at a time to the new server and our script will restore them at the same time.
- If the technicians cannot get the old server working again or if there is going to be a delay management will make a decision to switch the DNS to the new server. We do not take this decision lightly as clearly the moment we do this clients’ websites will work BUT will be restored from a daily backup so there may be a little data loss between the time the backup ran and the time the server failed. Clients using our name servers will need to do nothing. Clients using A Records will need to change their IP address at their domain registrar.
Even with our Current Disaster Recovery Plan in place with the weaknesses we identified if we had a complete server failure we would have everything back running within 24-36 hours max.
How can we speed this up / Improve?
- By the 10th February we will be deploying a spare empty server that will always be on line ready to take accounts. Presently we would wait until we had a server failed before deploying a new server. This will speed up the restore process by at least 4-6 hours (the time it takes to install and configure a new server). This server will sit empty and ready to receive accounts from any failed server.
What if the BWF website and Billing System went off line?
We strictly host our main website and billing system on a separate server from clients.
We currently host our main website on a 4 year old server. Whilst it is rock solid and we are loathed to move from something that works and has never had outages moving forward it seems prudent we move that to a totally separate new server and we plan to complete this within 6 weeks.
Our current BWF Contingency Plan is shown below and then we will look at how we can improve this moving forward and set a time frame:
We are very confident we could have had our main website and billing system back on line within 6 hours.
How can we speed this up / Improve?
We are still discussing how best to proceed to give really fast recovery in the event of failure. Suggestions from our notes from the staff meeting are below
Staff Resources to handle a Large Outage
All our staff are 100% committed to work extra shifts if needed. This is not a requirement and the last outage one tech came back into work after dinner without telling us and started working again to help our clients. Outages are rare and the voluntary ‘flexible’ approach to working means staff can get extended time off with their families on quiet times in return for working during outages.
We also have access to many techs from our remote support company in India to handle busy times and the management there can deploy technicians for us at short notice.
Giles our Senior Support Admin is USA based so during an outage it means that either Stephen K or Giles W can be on hand throughout the night. Giles working from USA and the time difference can allow Stephen to get needed sleep to take over next morning. Russell a local staff member is taking more of a role with us now and he is also available.
Finally it should be noted that nearly all our servers are provided by our long term partner Hostdime USA and Hostdime UK. They are a global hosting company and they have the strength of 300+ staff members in the event of outages. All our servers are on a management contract with them for hardware replacement and they are very professional. Knowing this should give you real peace of mind.