Saturday 27 July 10.08am
All servers are back on line.  Effectively we switched it off an on again.  Something on the node was trying to take more resources than was available (something that in theory should not be possible on this type of virtualization).  A reboot has fixed the issue.  Apologies we did not simply reboot at 9am but we assumed initially it was RAID array related given the unexplained high load on the host node itself.  As a failed RAID array could result in lost date our policy is clear that only a senior system admin is allowed to work on such servers.  Our senior staff work office hours and are on call on weekends. It took us 30 mins to get our on call technician into the office and this explains the delay.  We will continue to monitor and we thank you for your patience.

Saturday 27 July 9.50am

We shut down all Virtual Machines on this host node and are starting them up one at a time.  We're fairly certain whatever is happening is being caused by one virtual machine and this will allow us to isolate the problem VM and work to resolve that.

Saturday 27 July 9.45am

We're fairly certain the RAID disk array is healthy so that's a positive.  We believe something is going on with the dom0 which is the initial domain started by the Xen hypervisor on boot.  We're going to take all VMs on this node off line and look to reboot the node.  Thanks for your extended patience.

Saturday 27 July 9.25am

A senior staff member is now on shift and is actively troubleshooting this.  The instant we know more information we will update here.  

Saturday 27 July 9am

We are aware that there are three servers showing as offline on this host node on our virtualization platform.  Our on call senior admin has been called and he will be in the office shortly.  We'll have an update here soon.  Please be assured this is our top priority.

Saturday, July 27, 2019

« Back