[Resolved] 13 August 7.25pm : UK Issues

  • Saturday, 13th August, 2011
  • 19:45pm

7.03am

All servers are back up and running with the exception of one whose manual fsck check is taking a long tine and whis is currently at 75%.

=====================================================

4:55am

Most of the virtual machines are up with only 4 still offline. We are working to fully complete the restoration of service and again thank you for your extreme patience.

=====================================================

2am

We need to reboot all machines and run a manual fsck on them this is why some websites are showing errors .  The onapp system is not allowing us to boot the servers in recovery mode so we are doing it at the Hypervisor Level.

 

=====================================================

Midnight

The issue has been resolved.  The network SAN developed a fault as described below.  When a drive failed for some unkown reason one of the iscsi disks "lost" it's lvm metadata.  So basically, all of its partition data was lost causing the servers to appear down.  We had to recover the partition data.

=====================================================

11.17pm

The issue was caused by a hard drive in the SAN failing.  This should not in and of itself cause an issue as there is massive redundancy built into the SAN.  Here is what we believe happened but the techs will need to do a lot more research into this.  When the drive failed the raid controller sent an 'abort request' which we imagine caused the logical device to become unresponsive.  This caused iscsi to hang without a device for too long.

The above is a best guess and we need to have the techs look into this in a lot more detail as a drive failure on a SAN should absolutely not cause any downtime at all.

We are rebooting all servers now connected to the SAN and expect everything to come back soon.  Updated here as we have them.

=====================================================

10.06pm

Our noc are working on this now we just had an update that they are actively working on the servers.  Updates asap

=====================================================

9.09pm

We are still working on this issue and will have an update as soon as possible.  A technician is on his way to the Maidenhead Data Centre to check into the root cause of this intermittent issue.  We will be in contact with this tech as soon as possible.

=====================================================

7.25pm

We have had a few reports that we have confirmed of some websites on some UK servers not resolving or loading correctly.  We are actively investigating this issue now and we will have an update soon for you.  Thanks for your patience.  If possible please do not open a support ticket at this time.  Please take this message as an indication we are working on the issue and we will post further updates soon

=====================================================

« Back