Archive for the 'Information Technology' Category

Page 2 of 2

A Scientific Method for Troubleshooting

I think there’s a new heavyweight champion in the world of Mike’s IT-related pet peeves. It’s called “just trying a bunch of random things until my problem goes away.” This is in no way related to troubleshooting, which I will define as “uncovering the root cause of an issue and then resolving it deliberately.” While there are many different techniques for troubleshooting specific problems, I’m going to attempt to show how the scientific method can provide a common framework for more effective troubleshooting.

Step 1: Describe the problem

This is the easy part. Find out what the problem is and reproduce it. That 2nd part is important, because the typical user can’t always be trusted to know what they’re doing. By reproducing the problem, you can rule out user error and verify that there actually is a problem.

Step 2: Gather and analyze data

This is the part everyone likes to skip, and is the real subject of this rant. Step two requires direct observations in order to find out exactly what is happening. How you do this is very much dependent on the problem at hand, but it involves things like log and packet analysis (which may very well require a 3rd party tool not included within the base OS). In any case, please don’t just take a wild guess about what’s causing the problem and then proceed down the path of random experimentation. I don’t know for sure how people acquire this bad habit, but I have a strong suspicion that it comes from working in Microsoft Land, where the computers have personalities, rebooting is the inexplicable fix for everything, and a sea of GUIs makes it easy for the novice sysadmin to miss what’s really going on.

Part of problem in Microsoft Land is that proper troubleshooting tends to be a lot more difficult than it needs to be. For example, there is a severe lack of useful diagnostic tools included within the Windows OS itself. Why are the Windows support tools, resource kit tools, and IIS diagnostic tools still separate downloads? The same question can be asked about the Sysinternals Suite (which Microsoft has owned for several years now). Why are they still shipping obsolete utilities instead of their newer replacements (e.g. nslookup, which was obsoleted by dig many, many years ago)? And lastly, why does Microsoft constantly try to hide any information that could be useful for troubleshooting? Anyone who has ever had to view the message headers on an email in Outlook knows exactly what I’m talking about here, but I digress.

Step 3: Form a hypothesis

It isn’t until you figure out what‘s happening that you can address the question of why it’s happening. Step three is where you use the information you gathered in step two to determine a logical course of action. Remember that a hypothesis beginning with “maybe” or “I think” with little or no direct evidence to back it up is often a dead giveaway for someone who doesn’t know what the hell they’re talking about.

Step 4: Test your hypothesis

Perform your planned course of action.

Step 5: Analyze results and draw conclusions

Check to see if the issue is resolved. If not, revert your changes and go back to step three. When drawing conclusions, ask yourself how this problem occurred in the first place. Was your most recent fix permanent or just a temporary band-aid? If the fix was temporary, make sure you schedule a time to implement a permanent fix.

Are You Googlable?

I’ve decided that if you work in IT and I can’t find you on Google, then you might as well retire.

Stop Using Wizards!

My #1 problem with wizards is that they make people think they are capable of configuring things properly, regardless of whether or not they actually know what the hell they’re doing.  This is also one of the major gripes that I have with companies like Microsoft, who have managed to convince people all over the world that a pretty interface with a bunch of wizards is a good substitute for competence.  Sorry, but that’s bullshit, and every IT professional worth their salt knows this.

For the record, I am not just an elitist who advocates doing everything manually through a command line.  I understand that a wizard can help get you up and running quickly, and I think any wizard that tells you all the things it did would be a great learning tool.  However, I have yet to encounter a wizard that tells you much (if anything) about what it’s doing, and nobody is going to convince me that speed of implementation is more important than knowing how to configure something so that you can fix it when it breaks.

The bottom line is that if you feel the need to use the wizard (especially for critical security infrastructure like firewalls), then you have no business using it, because you obviously don’t know what you’re doing.