Measuring Virtual Networking Overhead

After discussing [ovs-discuss] ovs performance on ‘worst case scenario’ with ovs-vswitchd up to 100%.  One of my colleagues had a good idea: tcpdump the physical interface and the vif at the same time. The difference between when the packet reaches the vif and the packet reaches the physical device can help measure the amount of time in a userspace->kernelspace transit. Of course, virtual switches aren’t the only culprit in virtual networking overhead- virtual networking is a very complex topic.

I created a new tool to help measure this overhead for certain traffic patterns: netweaver. There’s lots of info in the README, so head on by!

NetWeaver does the following:

  • Retrieve the vif details from the hypervisor
  • Start a traffic generating command on source instance(s)
  • Gather packet capture from destination instance’s hypervisor
  • Analyze the packet captures from the vif and eth devices
  • Perform some basic statistical analysis (average, max, min, stdev) on the result set

I intend on using this for analyzing various configurations with Xen, guest OSes, and Open vSwitch.

iSCSI SAN performance woes with VMware ESX 3.5

We filed support requests with IBM and VMware and went through a very lengthy process without any results.

Each of our hosts had the following iSCSI HBAs:

  • QLA4010
  • QLA4050C

A while ago we found out QLA4010 is not on the ESX 3.5 HCL even though it runs with a legacy driver.

As our virtual environment grew we noticed storage performance lagging. This was particularly evident with our Oracle 10G Database server running our staging instance of Banner Operational Data Store. We were seeing 1.1 MB/sec and slower for disk writes.

We opened a case with VMware support and later with IBM support.  We provided lots of data to VMware and IBM while no one mentioned the unsupported HBA. No one at IBM mentioned it either. VMware support referred us to KB# 1006821 to test virtual machine storage I/O performance.

We ran HD Speed in a new VM mimicing the setup using RDM and using a dedicated LUN. Similar results.
We ran HD Speed on the same RDM on a physical machine and got 45 MB/sec.

All of our hosts had an entry like this in the logs (grep -i abort /var/log/vmkernel* | less)

vmkernel.36:Mon DD HH:ii:ss vmkernel: 29:02:31:16.863 cpu3:1061)LinSCSI: 3201: Abort failed for cmd with serial=541442, status=bad0001, retval=bad0001

Hundreds, if not thousands of these iSCSI aborts in the log files. We punted to IBM and they gave us the recommendation of running Host Utilities Kit. This optimizes HBA settings specific to IBM storage systems.

My recommendation ended up being two fold: Upgrade the ESX hosts because we were on an old build (95xxx) and replace the QLA4010 with a QLA4050C on each host.

Now that our ESX upgrade is complete we are seeing much better performance from our iSCSI storage.