All posts by finnzi

LPAR2RRD – A nice tool to gather (and watch) your historic VMware performance data

I installed LPAR2RRD to graph a VM environment about 6 months ago that holds a few hundred VMs. At that time I had some issues and didn’t really use it but kept it installed and allowed it to gather data.

However, I installed the newest version couple of days ago and was a bit impressed. Although this small tool does not look like much it is a simple solution for gathering performance data and help you out with debugging performance issues in your VMware environment. I have never used the LPAR (AIX) functionality but that part seems even more complete.

The errors I kept getting have been cleaned up (I didn’t really spend any time on debugging those when I did the initial installation). I kept getting errors when looking at cluster data etc. I also had some issues with the datastore graphs.

I sound like a bloody advertisement but I have no connection to the company that is working on the product except that I am a (happy) user 🙂

Bgrds,
Finnur

3PAR support in STOR2RRD

Hello!

I have been a user of STOR2RRD since ~2012 I think. It is a brilliant tool to gather historical performance information on your IBM San Volume Controller based storage (Storwize, SVC, etc).

However, I was looking for a new version few days ago and saw that they have added support for HP 3PAR. They also added support for some HDS storage as well as NetApp. So now, if you cannot cough up dough for [insert your best-of-breed storage management tool here] you might have a cool alternative there 🙂

Of course, this is no replacement for IBM’s TPC products but if you are looking for a single tool to gather performance statistics for your storage arrays and your Brocade fabrics, look no further. This is a very valuable tool to use in daily operations.

Check out their webpage here.

Bgrds,
Finnur

VMware HBA driver issues

Hi,

I have been migrating some tier 1 workloads over to VMware in the last 3 months or so. While doing so we ended up having to debug some performance issues which were both related to number of available vCPUs to the operating system as well as performance related issues.

We started debugging the vCPU based issues. Our application administrators spent loads and loads of time and effort on getting those issues fixed. Most of them were just issues because of settings in the application it self and after they had been modified to represent the actual number of available vCPUs of the VMs running those workloads things mostly got back to normal.

However, we also had some performance issues with our storage. While we were running those application on the old hardware things seemed to run good enough. Nothing was great but nothing was horrible either.

After the migration was done and the vCPU issues were sorted we started looking at those storage related issues. The new hardware was a lot more powerful so it seemed obvious that it could throughput more on the storage side. We were seeing higher throughput rates (MB/s) but response time was horrible at times.

While chasing some leads I was starting to think that the storage was just becoming the bottleneck. Countless hours were spent analysing graphs from the OS, the arrays and our VMware hosts.

Finally, we noticed that the response times from the VMware hosts did not match with the response times of the array. That led us to thing that the issue was either on the host side or it could be a network issue.

We opened up a ticket with our hardware vendor and got the support team to go over our VMware driver+firmware setup (which was the same as we had been running on countless other hosts with what we thought was not an issue).

Nothing obvious came out of the support case, but after our monthly “no-change” period finished after the new year we decided to update the HBA driver+HBA firmware.

*BAM*!

Finally we got response times down. It finally matched the latency on the array. I kicked my self since I cannot even remember how often I have yelled at other people: “UPGRADE YOUR DRIVERS AND FIRMWARE!”…………..live’n’learn people! 🙂

The moral of the story: Check and see if you are running the latest supported firmware+driver versions on your hosts before spending countless hours analysing performance data when you are debugging some damned performance issues! 😉

Bgrds,
Finnur

On VMware vSphere and driver/firmware issues

Hi,

I have spent the better half of this year planning and finishing preparing to migrate some large databases on to virtual machines running on top of VMware vSphere.

While working through specs and other stuff I read up on loads and loads of forums, white papers, guides and anything else I could find on the subject.

In my research I started to find more and more posts that mentioned issues with drivers and/or firmware on VMware hosts, and not specific to any one vendor. Of course this worried me somewhat. So I did some more research on this.

My conclusion was very simple after reading through a lot of blog posts and speaking to multiple experts on VMware ESXi. Since we are seeing larger and larger mission critical systems virtualized we are pushing the hardware a lot more then we have done normally. And when we push the hardware to 70%, 80% or even 100% utilization, flaws that were hidden before are often more visible then they have been in the past when systems were only utilized at like 30-40% of the resources that were available to the operating system.

Just thought I should write this down….especially since I am watching one of my DB hosts pushing its CPU hard! 🙂

Bgrds,
Finnur

Using LVM to migrate between arrays (and raw device mapped LUNs to VMFS backed ones)

Hi,

Recently I have been working on a project that requires me to migrate few multi-terabyte databases from physical to virtual machines.

Since we were lucky enough that the LUNs for those databases were hosting a LVM-backed filesystems I was able to present the LUNs as RDMs to the VMware virtual machines and then create new virtual hard disks and use the magical pvmove command to migrate the data.

The total downtime for each database is around 5-15 minutes and is mostly due to the fact that we have to present the LUNs to the virtual machine, mount the file systems and then chown the database files to a new uid/gid. After that is done the databases are started.

When the database has been verified to work as expected we created new virtual hard disks, ran pvcreate on them and import them to the volume group we were migrating.

After that we just fire up a trusty screen session (or tmux or whatever!) and run the mythical command: pvmove -i 10 -v /dev/oldlun /dev/newlun.

When that command finishes we remove the LUN from the volume group with vgreduce, run pvdestroy on the LUN and then remove the LUN from the virtual machine (you might want to run echo 1 >/sys/block/lunname/device/delete before you do that), unmap the LUN from the ESXi hosts and we are done!

The biggest reason for us not to use RDMs is that the flexibility we get by using native virtual disks kind of nulls all performance gains we might gain (with emphasis on might) by using RDMs (although I have yet to see any performance loss due to using VMFS). And when we finally make the jump over to vSphere 6.x I can migrate those virtual disks straight to VVols.

The only sad thing in our case is that by using this method we are stuck on EXT3 since the file systems are migrated over from old RHEL5 machines. I’m not sure I want to recommend anyone to run a migration from EXT3 to EXT4 on 6-16TB file systems 😀 (at least make sure you have a full backup available before testing this!).

Bgrds,
Finnur

My faith in vendor support has been restored!

Hi,

Recently I had to use the support of two different vendors we have started using more in the last year or so.

I have had my share of dealing with support at random large enterprise software and hardware vendors. And I have had my share of “uhh…have you turned it on and off again” with a 50000$ server. Which I was not so willing to reboot just so a first level support agent could go through his script (yes, I am a evil customer!).

So, the first case was with a hardware vendor. I had mentally prepared my self for a fight with first level support. I opened a case, gave them as much detail as I could and went on with my day.
In about 30 minutes someone contacted me (this was not a system down issue so I just opened up a “normal” ticket) and gave me access to a site where I could upload the relevant hardware logs. Around 20 minutes later I got a response from a very knowledgeable person which gave me a solution. Case closed in less then two hours.

Another case I had recently was that I have been installing and configuring new machines to host our databases. I had a question that I needed to get an answer for so I could finish building out the master which I would use to duplicate to our new fancy database virtual machines.
My experience with our previous OS vendor for our database servers have been horrible, slow responses and pretty much all had the classic “have you tried to turn it off and on again”. So, a case was opened where I layed out my question. Again, I prepared my self for loads and loads of script-based answers and even pretty much gave up any hope that I would get some answers.
Well, to say the least I got a reply in about two hours asking for some more info, and after providing the info I had a well backed answer from a very knowledgeable person in another 20-30 minutes. This was a support ticket with a very low priority.

My faith in support has been restored. HP and Oracle, keep up the good work!

Bgrds,
Finnur

What is the single most important thing Oracle VM is missing?

I have been going through the Oracle VM feature set.

As a virtualization solution it actually looks pretty good. The cost is in the lower end and it seems pretty feature complete.

But there is one thing missing…one huge feature.

VM snapshot based backups. I’m betting that if they would throw a API for taking snapshot based backups and make a deal with Veeam to support it this would actually make quite a lot of system administrators take a harder look at using Oracle VM for at least some projects (i.e, virtualizing Oracle applications).

From my standpoint this is my biggest issue. Oracle has already gotten servers from multiple vendors certified (hcl) and it seems that Oracle is playing nicely with the larger hardware vendors (IBM/Lenovo, HP, Cisco).

Oracle – your techs might want to take a hard look at this feature – this will actually help you guys gain larger market share in data center virtualization!

Just my two cents!

Bgrds,
Finnur

Oracle VM Server (x86) tutorials

I was doing some googling around Oracle VM and found some nice tutorials at unixarena.com.

If you are looking at Oracle VM you should go through these tutorials – they give you some idea about what Oracle VM is about 🙂

Bgrds,
Finnur

New features in IBM’s Easy Tier 3 in SVC/Storwize code 7.5.0

Howdy all,

If you are a IBM Storwize/San Volume Controller user you might want to checkout the new Easy Tier features in the 7.5.0 release.

It includes a “acceleration” feature where data is moved a lot faster between tiers then before.

See this document for more information.

Bgrds,
FOG

3PAR StoreServ 7000 – Peer Persistence links

Howdy all,

I have been testing a 3PAR Peer Persistence setup using two 3PAR StoreServ 7200c, dual interconnected fabrics between sites and a multi-site VMware cluster (although only one node per site).

It works flawlessly!

Being able to take a array offline (disruptively, removing power to the controller shelf) and the only thing that happens is about ~10 second “delay” (while Peer Persistence fails the VMFS volumes over) for the virtual machines is pretty awesome.

We are still missing VVOL support for replicated volumes (and vMSC) but hopefully it will come later this year.

Here are the most important links on Peer Persistence:
VMware.com (KB article 2055904)
HP’s own Implementing vSphere Metro Storage Cluster using HP 3PAR Peer Persistence

And for a added bonus – HP 3PAR SSMC 2.1 makes Peer Persistence configuration easy as 1-2-3 by Techazine.com

The whole thing takes about 1 hour to configure when you know what you are doing and adding a new volume to the Peer Persistence configuration is a snap.

Bgrds,
FOG