Archive for the 'Rants' Category

A Scientific Method for Troubleshooting

I think there’s a new heavyweight champion in the world of Mike’s IT-related pet peeves. It’s called “just trying a bunch of random things until my problem goes away.” This is in no way related to troubleshooting, which I will define as “uncovering the root cause of an issue and then resolving it deliberately.” While there are many different techniques for troubleshooting specific problems, I’m going to attempt to show how the scientific method can provide a common framework for more effective troubleshooting.

Step 1: Describe the problem

This is the easy part. Find out what the problem is and reproduce it. That 2nd part is important, because the typical user can’t always be trusted to know what they’re doing. By reproducing the problem, you can rule out user error and verify that there actually is a problem.

Step 2: Gather and analyze data

This is the part everyone likes to skip, and is the real subject of this rant. Step two requires direct observations in order to find out exactly what is happening. How you do this is very much dependent on the problem at hand, but it involves things like log and packet analysis (which may very well require a 3rd party tool not included within the base OS). In any case, please don’t just take a wild guess about what’s causing the problem and then proceed down the path of random experimentation. I don’t know for sure how people acquire this bad habit, but I have a strong suspicion that it comes from working in Microsoft Land, where the computers have personalities, rebooting is the inexplicable fix for everything, and a sea of GUIs makes it easy for the novice sysadmin to miss what’s really going on.

Part of problem in Microsoft Land is that proper troubleshooting tends to be a lot more difficult than it needs to be. For example, there is a severe lack of useful diagnostic tools included within the Windows OS itself. Why are the Windows support tools, resource kit tools, and IIS diagnostic tools still separate downloads? The same question can be asked about the Sysinternals Suite (which Microsoft has owned for several years now). Why are they still shipping obsolete utilities instead of their newer replacements (e.g. nslookup, which was obsoleted by dig many, many years ago)? And lastly, why does Microsoft constantly try to hide any information that could be useful for troubleshooting? Anyone who has ever had to view the message headers on an email in Outlook knows exactly what I’m talking about here, but I digress.

Step 3: Form a hypothesis

It isn’t until you figure out what’s happening that you can address the question of why it’s happening. Step three is where you use the information you gathered in step two to determine a logical course of action. Remember that a hypothesis beginning with “maybe” or “I think” with little or no direct evidence to back it up is often a dead giveaway for someone who doesn’t know what the hell they’re talking about.

Step 4: Test your hypothesis

Perform your planned course of action.

Step 5: Analyze results and draw conclusions

Check to see if the issue is resolved. If not, revert your changes and go back to step three. When drawing conclusions, ask yourself how this problem occurred in the first place. Was your most recent fix permanent or just a temporary band-aid? If the fix was temporary, make sure you schedule a time to implement a permanent fix.

Are You Googlable?

I’ve decided that if you work in IT and I can’t find you on Google, then you might as well retire.

Stop Using Wizards!

My #1 problem with wizards is that they make people think they are capable of configuring things properly, regardless of whether or not they actually know what the hell they’re doing.  This is also one of the major gripes that I have with companies like Microsoft, who have managed to convince people all over the world that a pretty interface with a bunch of wizards is a good substitute for competence.  Sorry, but that’s bullshit, and every IT professional worth their salt knows this.

For the record, I am not just an elitist who advocates doing everything manually through a command line.  I understand that a wizard can help get you up and running quickly, and I think any wizard that tells you all the things it did would be a great learning tool.  However, I have yet to encounter a wizard that tells you much (if anything) about what it’s doing, and nobody is going to convince me that speed of implementation is more important than knowing how to configure something so that you can fix it when it breaks.

The bottom line is that if you feel the need to use the wizard (especially for critical security infrastructure like firewalls), then you have no business using it, because you obviously don’t know what you’re doing.

Kaseya Monitoring Sucks

I have been using Kaseya for over six months now, and even after the recent update to version 5.0, the network monitoring functionality remains pretty much a joke. I don’t understand how this app became so popular in the MSP world. Here’s a list of reasons why:

  • No concept of “state” for services – Kaseya keeps track of host state (ie: online/offline) with a nice little icon, but it does not do anything similar for services. Therefore, there is no way to get a complete picture of a network’s current state “at a glance.” Services should not be treated like 2nd class citizens this way, because plenty of services are just as critical as (if not more than) the actual hosts they are running on.
  • No alerts when a service returns to an “OK” state – It’s not enough to receive an alert just when a host/service check fails. You need a corresponding alert when it starts succeeding again too, and here’s why: False alarms are pretty much a fact of life with all monitoring applications. There are lots of reasons why a check might fail every now and then (network congestion, etc.). Therefore, I don’t want to have to scramble every time I see an alert just to find out that the service is already back up again. I want the monitoring application to tell me when it’s back up. This is especially important when I don’t have immediate access to a computer (e.g. if I’m on the road and I have alerts going to my phone).
  • Graphs are limited to ~2000 data points – This is the problem with storing raw performance data in a SQL database (I admit that I am simply assuming that’s what they have done here). More data makes a slower database, so it makes sense that they would hard-code a limit on the amount of data that can be stored. But how in the hell do they expect their customers to do trend analysis with only ~2000 data points? Assuming a service check every 10 minutes: 60 minutes * 24 hours / 10 = 144 data points per day. ~2000 data points total / 144 per day = a history of ~14 days. Compare this to something like Cacti which can easily handle years worth of performance data using RRDtool.
  • SNMP monitoring will bring your network to its knees – Amazingly, there is no way to modify the polling interval for SNMP queries. It appears that the Kaseya agent will simply execute snmpget in rapid succession until you notice that your server’s CPU is pinned to 100% and you are forced to disable SNMP monitoring altogether. Even if they increased the hard-coded polling interval, you still have no control over the size of your graphs due to the data point limit listed above (in my experience, you might get 1-2 days if you’re lucky). What baffles me the most is that you can set the polling interval for WMI monitoring. So why would they not implement this for SNMP too? I don’t get it.

To be fair, Kaseya seems to be paying a lot of attention the user complaints on their forums lately, and they’ve already started to address some of these problems. Hopefully I’ll have less to complain about in the coming weeks.

Update: Due to all the Kaseya bashing going on in the comments, I just want to make it clear that I like Kaseya as a Windows desktop and server management tool. My only major complaints are with the way it does monitoring (hence, the title of this post).

Windows 2008 Telnet (not SSH) Server

Have you heard that Windows 2008 will be able to run in a command-line only mode, but will continue to ship with a telnet server instead of SSH? This is awesome, seeing as how telnet is an insecure, antiquated method of remote access that should not be used by anyone under any circumstances. Congratulations Microsoft! Welcome to the 1970’s! Should we expect the SSH server in Windows Server 2033?

Seriously, what the fuck are those people doing over there?

Update According to Microsoft, there will be “a technology like this included in Windows Server 2008 called WinRS; or Windows Remote Shell. This command line tool allows administrators to remotely execute most cmd.exe commands using the WS_Management protocol.” Too bad it sucks!

See Also: “Not Invented Here Syndrome