Saturday, June 5, 2010

Get out of the sysadmin firefighting business

A while back there was a post on the lopsa-discuss mailing list about time management.  If you read it and the ensuing thread there are a number of really good suggestions about how to more effectively handle your work time so that you are more productive, less harried and start to really gain a sense of situational awareness about your environment.  It's all good stuff and I have used many of the suggestions in that thread with great success.  If you're a system administrator and feel that you need 36 hours in a day, go read that thread and then do at least one of the recommendations.  You'll never look back.

However, there was one particular bit from the original post that really has been hanging out in the back of my mind, bugging me:
I frequently find myself dealing with so many little things throughout the day that by the end of the day I feel like I've been busy but can't really point at what I've done during the day.
So the entire day is running around "fighting fires?"  Time management can't fix that problem, trust me, I've tried.  It can help and it's a great first step, you should do it.  But at some point you need to stop looking for better firefighting techniques to fix problems and start looking at fireproofing things so they don't catch on fire in the first place.  You might think that's a really hard (or even impossible) thing to do and that asbestos underwear is itchy.  Luckily, you'd be wrong on the first part of that thought, and I'd like to talk about some high level, introductory concepts that can help you get started fireproofing quickly.

And, no, I don't really want to talk about your underwear.

The first step for me is always to fix the flare-ups, the small reoccurring fires.  If you're constantly fighting the same fire, over and over again, then it's time you showed up with something more than a garden hose.  You'll be happy, your users will be happy, your bosses will be happy.  And as a wonderful side effect you'll have more time to manage because you won't be in a reactive mode all of the time fixing things! 

In a perfect world you'd see the problem at its' very core, tackle the it with precision, and resolve the problem once and for all.  Pesky print server?  Replace it!  Unhappy database server?  Upgrade it!

We don't live in a perfect world though, so sometimes the only real fix is manage the problem such that the pain it causes stays at a bearable level until you can handle the problem correctly.  One method is to isolate the problem so that when it explodes it can't take anything else out.  For example, move that troublesome application that causes whatever hardware it's on to lock up and require a reboot to a dedicated host. That way the reboot only effects the application instead of everything on the server.  Another method is to install some sprinklers that will automatically put out the fire for you.  Got a service that likes to leak memory?  Automate a restart during the lowest usage period so that the leak doesn't cause problems during peak usage times.

That's all fine for technical issues.  If you're constantly putting out fires from end-user questions and tickets there are some other strategies that can help.  Documentation is one method, but self-service documentation portals are only so useful.  Often we forget to update the docs so they're a little bit wrong, users don't follow directions carefully, some just don't want to, etc...  I additionally take a three pronged approach to handling fires from users:
  • Educate: Try to educate your users when you can so they understand the problem they're having.  If you explain it well enough, they can synthesize the information and use it to help themselves later.  Better yet, if you have desktop support or helpdesk staff, educate them so they can fix the problem on first contact with the user so everyone walks away happy.
  • Automate: Accountants are not sysadmins.  They do not want to follow a 12 step process to reset their passwords.  Automate thing things people do frequently that cause problems so it's easy and less error prone.
  • Facilitate: Some people just are not reasonable.  Facilitate their needs by getting it done without argument or hassle so everyone can get on with their life.  Often just doing whatever it is will take less time than arguing about it anyway, so skip the drama and suck it up.
A similar strategy can work for management initiated fires too, though with a heavier does of facilitation.

The take away here is that if you're fixing the same thing over and over again, you're not really fixing it.  Step back, look at the problem from all sides, examine the pain points and find a way to get the fire under control enough so you get some time and sanity back and so that your users don't feel like they need those pitchforks and torches.  If you can put the fire out once and for all, even better, if not, you're probably dealing with a big fire which takes a separate type of attention.

A side effect of fixing the flare-ups is that the air is a lot clearer to see the smoke from the real fires.  So my second step in fireproofing is to start looking for that smoke and if possible, the flames at the source.  In order to see the fire before your users do start monitoring the performance, capacity, and availability of your environment. 

If you don't have monitoring in place already, put some in and start with monitoring something about everything.  Don't spend huge amounts of time or money on monitoring at this point because you'll have no idea what you really need.  Stand up some cheap and easy monitoring solution and start tossing stuff into it and see what's useful.  If something breaks, put in a monitor for it.  Eventually you'll have enough monitoring in place (and experience from it) to make an educated and well formed decision about what you need to do in order to get to a point of comprehensive and useful monitoring.  And be sure do do that evaluation, otherwise....

If you do have monitoring, fix it.  Seriously.  If you jut had to fix a series of flare ups and suffer from interruptions every minute of the day because something is broken or needs attention and you weren't proactive in getting it resolved before the users took notice then something is fundamentally broken with your monitoring.  Evaluate what you monitor, how you monitor it, and what you monitor it with. Look to see where the breakdown is.  Too many fine grained monitors make even a server reboot look like a calamity?  Add in some dependencies.  Monitoring package doesn't monitor services well?  Add something else that does.  Is it really hard to setup proper monitoring because each machine needs a finicky client installed?  Find something new.

People laugh and think I'm joking when I say "Monitoring is a journey, not a destination" but I'm not.  Things change and your monitoring will need to change along with those things.  As a system administrator, it is the single most useful thing you can have in your arsenal.

So that's my simple two step, two minute introduction of how to start getting out of the sysadmin firefighting business.  I don't maintain that these suggestions will put out every fire you may have or come across.  I do think they offer a good place on the ground to start with.  In future posts I'd like to examine how to deal with the larger fires that arise, tire fires, better fireproofing though design, and what kind of tools are out there to help you fight the fires.