It's All About the (Disk) Platters...
So, I was brought into yet another high-key, holy-cow, we-just-have-to-fix-this, something's-wrong-with-your-product session a few days ago. After looking at the symptoms, my question was simple - "Have you checked disk performance?" When the response came in the negative, my immediate response was, "enable statistics, run Perfmon (this was a Windows platform), and retest." Well, the results hit my chat window yesterday:
Disk queue lengths are often misunderstood, so allow me a moment to explain. This metric simply reports the number of transactions that are waiting on disk input/output. There's no direct correlation to the SIZE of each transaction; you could have 5 transactions awaiting 1Kb each, 5 transactions awaiting 15Mb each, or anything in between. The important thing to keep in mind is that, back at the application layer, there's probably an executing thread behind each of those queued transactions. In all likelihood, there are other threads waiting upon those "pending disk I/O" threads...and you see where bottlenecks at this layer can create significant performance problems. In this example, the worst-case scenario (the peak disk queue length) showed over 4500 transactions awaiting disk I/O. In a word, ecch.
For the nth time, folks, it's all about disk I/O AS PERCEIVED BY THE OPERATING SYSTEM AND APPLICATIONS! This application happened to be running under a very high-powered VMware machine, with an equally high-powered SAN providing storage. When they looked at overall disk performance between the SAN and the VMware server, they saw "high throughput" and "low latency;" however, viewing the question from the perspective of the individual host OS presented a markedly different result (and definite "red flag").
Every OS/hardware vendor is in agreement that disk queue lengths in excess of 2.0 indicate I/O bottlenecks. I don't care what your SAN folks say, and I don't care what your virtualization folks say - check it out for yourself. At the OS layer, you can check these metrics with Perfmon (Windows) or iostat (Unix/Linux). If your average queue length is higher than 2 over a substantial period of time, you're suffering from disk bottlenecks. If you see peaks above 2.0 during particular activities (e.g. backups, particular server tasks, scheduled operations, etc.), then you're taking a performance hit there as well. This needs to be a part of every admin's "general health" dashboard - get on it!