I have been migrating some tier 1 workloads over to VMware in the last 3 months or so. While doing so we ended up having to debug some performance issues which were both related to number of available vCPUs to the operating system as well as performance related issues.
We started debugging the vCPU based issues. Our application administrators spent loads and loads of time and effort on getting those issues fixed. Most of them were just issues because of settings in the application it self and after they had been modified to represent the actual number of available vCPUs of the VMs running those workloads things mostly got back to normal.
However, we also had some performance issues with our storage. While we were running those application on the old hardware things seemed to run good enough. Nothing was great but nothing was horrible either.
After the migration was done and the vCPU issues were sorted we started looking at those storage related issues. The new hardware was a lot more powerful so it seemed obvious that it could throughput more on the storage side. We were seeing higher throughput rates (MB/s) but response time was horrible at times.
While chasing some leads I was starting to think that the storage was just becoming the bottleneck. Countless hours were spent analysing graphs from the OS, the arrays and our VMware hosts.
Finally, we noticed that the response times from the VMware hosts did not match with the response times of the array. That led us to thing that the issue was either on the host side or it could be a network issue.
We opened up a ticket with our hardware vendor and got the support team to go over our VMware driver+firmware setup (which was the same as we had been running on countless other hosts with what we thought was not an issue).
Nothing obvious came out of the support case, but after our monthly “no-change” period finished after the new year we decided to update the HBA driver+HBA firmware.
Finally we got response times down. It finally matched the latency on the array. I kicked my self since I cannot even remember how often I have yelled at other people: “UPGRADE YOUR DRIVERS AND FIRMWARE!”…………..live’n’learn people! 🙂
The moral of the story: Check and see if you are running the latest supported firmware+driver versions on your hosts before spending countless hours analysing performance data when you are debugging some damned performance issues! 😉