overload last night & Scheduled Maintenance Notice

  • Friday, 4th March, 2011
  • 08:02am

Post Maintenance Report - 7.45am 8 march 2011

Please read the following report.  Clients on this server will know in the past 12 months there has been virtually zero issues with this server - uptime has been near 100%

For four evenings at 3am the server load spiked and sites became unreachable for a short period of time.  Investigation determined there were hundreds of cron jobs being run by one user on this server all starting at the same time.  This issue has now been rectified.  Unfortunately this is a slight disadvantage of shared hosting - one user on the server can affect others.

As we had the server off the rack (as initially we wrongly assumed it was hardware related) we took the decision to replace the separate drive we have for time stamped backups (separate from the RAID array) as smart stats were showing a few minor errors.  Last evening the backups were run from the server - as the backup drive was new it needed to do a full backup and the server slowed down slightly.  This is a one off as tonight the server will run incremental backups again.

This server will be put back on the rack in the next 24-28 hours in the middle of the night UK time with least users will be affected.  Downtime will be 5-10 minutes tops.

Here's hoping for another 12 months of no problems with this server.  Sorry to those who were affected



Good Morning

This notification applies to in the USA data centre.  No other server is affected.

At 4am this morning my colleague on duty was informed by our monitoring system that had a very high load.  The server was not down but appeared inaccessible or really slow for some people.  We rebooted the server.  Then it happened again and we managed to get into the server and discovered a process was consuming a lot of cpu power:


top - 04:40:43 up 1:13, 1 user, load average: 352.10, 264.17, 131.29

18512 root 18 0 6224 812 640 R 99.9 0.0 9:12.66 /usr/sbin/repquota -auv


As you can see the load average on the server had risen from 131 to 352 in a short period of time.  We took the decision to reboot the server.  The load seemed to calm down but it would appear the file system may be out of sync and we need to schedule a reboot where we can perform a FSCK on the file system.  We have scheduled this for 3am UK time on Saturday morning.

Thank you for understanding the need to perform this FSCK.

« Back