uptime
The uptime statistic is almost irrelevant in high availability infrastructures such as the one shown here.

Defining a server's true uptime/downtime

Jarrod Spiga31 August 2008, 11:06 PM

Jack of Most Trades | Jarrod Spiga looks for an appropriate measure of the stability of a system.


A few weeks ago, a colleague at work bemoaned that a blackout “ruined” their two-year-plus uptime on the Linux server that they had at home. While I have come across even longer uptimes over the years, these instances raise one question - Is uptime an appropriate measure of the stability of a system?

Consider a typical production Linux web and mail server that hosts dozens of busy virtual hosts. A cron job is run daily that processes the web server logs and generates statistics for each virtual host. Because this job stresses the systems I/O and memory subsystems, the process is scheduled to run when load on the server is typically low. However, the load caused by this process does occasionally force the email server to deny inbound connections.

For all intents and purposes, the mail server is down during this time; albeit the impact is trivial, considering the retry mechanisms that are incorporated into the SMTP protocol. This specific example was chosen since it’s fairly simple to replicate on any Linux system running Sendmail without manual intervention. Regardless, it does constitute downtime – and more importantly, does not reset the uptime counter.

Of course, you can always manually shutdown the web service. This too constitutes downtime (most likely with a greater impact) without resetting a systems uptime.

On the other side of the equation, consider some of the fault tolerance methods that are often deployed in enterprise infrastructures to mitigate the impacts of such issues. I personally look after a number of web servers that sit behind a pair of load-balancers on a daily basis. I can easily bring down one web server, install security updates, new drivers, update firmware or even switch a server off entirely without affecting the uptime of the websites hosted on that server. Yet the uptime counter on such a system is affected.

Both of these are situations where a system’s uptime does not correlate to the uptime of services provided by that system.

Even if you have fault tolerance built in to a system, things can still go wrong – as evidenced by the meltdown of the Optus network a few weeks back. I greeted news that Optus are installing a another fibre cable to provide additional redundancy to this section of their network with a muted laugh, seeing as my impression was that the root failure was more of an operational one (I argue that they should have had a cold-spare DWDM amplifier card ready somewhere where it could have been quickly shipped to the Stanthorpe site before their other fibre link was dug up by a water boy in a back hoe).

When looking at more non-enterprise systems, I don’t know if having long uptimes is such a good thing – even if additional fault tolerance methods have not been deployed.

The first thing I think of when I see a long uptime is “has this system been updated?” For many people, a Windows system with a long uptime obviously hasn’t been updated, mostly because Windows security updates tend to require reboots. But don’t be lulled into a sense of security because you’re running Linux – kernel updates require reboots too.

The uptime statistic can be useful in certain situations. For instance, it can be used to identify when a server was last rebooted and alert to the fact that it had rebooted – which can be handy when a blue screen of death or kernel panic occurs at some ungodly hour of the morning and the automatic server recovery features of our servers kick in.

Going back to my work colleague’s server, he mentioned that he plans to rebuild his server this weekend – mainly because the distro of Linux that he’s using is no longer maintained and has not been updated in more than a year. I ask, is a two-year uptime really worth bragging about when you leave the server that holds all of your data potentially insecure and vulnerable to attack?



Jarrod Spiga is an infrastructure engineer.



Post your comment



anonymous user Anonymous user


Tags