Jack of Most Trades | Jarrod Spiga looks for an appropriate measure of the stability of a system.
A few weeks ago, a colleague at work bemoaned that a
blackout “ruined” their two-year-plus uptime on the Linux server
that they had at home. While I have come across even longer uptimes over the
years, these instances raise one question - Is uptime an appropriate measure of
the stability of a system?
Consider a typical production Linux web and mail server that
hosts dozens of busy virtual hosts. A cron job is run daily that processes the
web server logs and generates statistics for each virtual host. Because this
job stresses the systems I/O and memory subsystems, the process is scheduled to
run when load on the server is typically low. However, the load caused by this
process does occasionally force the email server to deny inbound connections.
For all intents and purposes, the mail server is down during
this time; albeit the impact is trivial, considering the retry mechanisms that
are incorporated into the SMTP protocol. This specific example was chosen since
it’s fairly simple to replicate on any Linux system running Sendmail
without manual intervention. Regardless, it does constitute downtime –
and more importantly, does not reset the uptime counter.
Of course, you can always manually shutdown the web service.
This too constitutes downtime (most likely with a greater impact) without
resetting a systems uptime.
On the other side of the equation, consider some of the
fault tolerance methods that are often deployed in enterprise infrastructures
to mitigate the impacts of such issues. I personally look after a number of web
servers that sit behind a pair of load-balancers on a daily basis. I can easily
bring down one web server, install security updates, new drivers, update
firmware or even switch a server off entirely without affecting the uptime of
the websites hosted on that server. Yet the uptime counter on such a system is
affected.
Both of these are situations where a system’s uptime
does not correlate to the uptime of services provided by that system.
Even if you have fault tolerance built in to a system,
things can still go wrong – as evidenced by the meltdown of the Optus
network a few weeks back. I greeted news that Optus are installing a another
fibre cable to provide additional redundancy to this section of their network
with a muted laugh, seeing as my impression was that the root failure was more
of an operational one (I argue that they should have had a cold-spare DWDM
amplifier card ready somewhere where it could have been quickly shipped to the
Stanthorpe site before their other fibre link was dug up by a water boy in a
back hoe).
When looking at more non-enterprise systems, I don’t
know if having long uptimes is such a good thing – even if additional
fault tolerance methods have not been deployed.
The first thing I think of when I see a long uptime is
“has this system been updated?” For many people, a Windows system
with a long uptime obviously hasn’t been updated, mostly because Windows
security updates tend to require reboots. But don’t be lulled into a
sense of security because you’re running Linux – kernel updates
require reboots too.
The uptime statistic can be useful in certain situations.
For instance, it can be used to identify when a server was last rebooted and
alert to the fact that it had rebooted – which can be handy when a blue
screen of death or kernel panic occurs at some ungodly hour of the morning and
the automatic server recovery features of our servers kick in.
Going back to my work colleague’s server, he mentioned
that he plans to rebuild his server this weekend – mainly because the
distro of Linux that he’s using is no longer maintained and has not been
updated in more than a year. I ask, is a two-year uptime really worth bragging
about when you leave the server that holds all of your data potentially
insecure and vulnerable to attack?

Jarrod Spiga is an infrastructure engineer.