Giving FlowBAT a try

FlowBAT is a graphical flow based analysis tool.

FlowBAT is a very snappy UI which wraps flow collection software called SiLK. My interests with FlowBAT and SiLK are around using Open vSwitch to push sFlow data to SiLK and FlowBAT to quickly query that data. Hopefully I’ll get around to posting about OVS, sFlow, SiLK and FlowBAT.

While I was getting started with FlowBAT (and SiLK) I saw there weren’t any ansible roles for either component. The FlowBAT team maintains some shell scripts to install and configure FlowBAT and the SiLK project has SiLK on a box. I ported both of these projects to ansible roles and posted them to ansible-galaxy.

To give them both a try in a Vagrant environment, check out vagrant-flowbat.

Troubleshooting LLDP

why

LLDP is a wonderful protocol which paints a picture of datacenter topology. lldpd is a daemon to run on your servers to receive LLDP frames outputs network location and more.  There’s also a recently patched lldp Ansible module.

Like all tools, using LLDP/lldpd has had some issues. Here’s the ones I’ve seen in practice, with diagnosis and resolution:

Switch isn’t configured to send LLDP frames

Diagnosing:

tcpdump -i eth0 -s 1500 -XX -c 1 'ether proto 0x88cc'

Switches will send the LLDP frames by default every 30s. The switchport’s configuration needs to enable LLDP.

Host isn’t reporting LLDP frames

Generally, this means lldpd isn’t running on the server. If the lldp frames are arriving (from the above tcpdump), but lldpctl will returns nothing.

Diagnosing:

lldpctl # returns nothing
pgrep -f lldpd # returns nothing
service lldpd restart

Be sure that the lldpd service is set to run at boot and take a look at configuration options.

NIC is dropping LLDP frame

By far the most frustrating- NIC firmware issues which can cause the NIC to drop lldp frames. (Page 10, item 14)

The way this one manifests:

  • lldpctl reports nothing
  • lldpd is running
  • switch is configured to send LLDP frames

Diagnosing:

Run a packet capture on the switch to ensure that the LLDP frames are being sent to the port. If you’re able to see the frame go out on the wire and traffic is otherwise functioning normally to the host, the problem lies with the NIC.

The fix here was to apply the NIC firmware upgrade- after that, lldp was good to go!

SDN Development Environment

Recently, I began a deep dive into more SDN and OpenFlow. Overall I was very happy with the process and quality of the material out there for newcomers.

However, I noticed a gap when I hit my first stumbling block. I set up a mininet instance, noticed it was running Open vSwitch (OVS) v2.0. I needed a newer version of OVS, and turfed the mininet instance while the upgrading OVS. It quickly became apparent that I needed a repeatable development environment setup.

I created ansible-sdn-dev to help out with this problem.

ansible-sdn-dev includes Ansible roles to build, install and configure these applications:

ansible-sdn-dev also includes a Vagrantfile so you can clone the repository, issue vagrant up and start hacking!

Upgrading Open vSwitch

Operating Open vSwitch brings a new set of challenges.

One of those challenges is managing Open vSwitch itself and making sure you’re up to date with performance and stability fixes. For example, in late 2013 there were significant performance improvements with the release of 1.11 (flow wildcarding!) and in the 2.x series there are even more improvements coming.

This means everyone running those old versions of OVS (I’m looking at you, <=1.6) should upgrade and get these huge performance gains.

There are a few things to be aware of when upgrading OVS:

  1. Reloading the kernel module is a data plane impacting event. It’s minimal. Most won’t notice, and the ones that do only see a quick blip. The duration of the interruption is a function of the number of ports and number of flows before the upgrade.
  2. Along those lines, if you orchestrate OVS kernel module reloads with parallel-ssh or Ansible or really any other tool, be mindful of the connection timeouts. All traffic on the host will be momentarily dropped, including your SSH connection! Set your SSH timeouts appropriately or bad things happen!
  3. Pay very close attention to kernel upgrades and OVS kernel module upgrades. Failure to do so could mean your host networking does not survive a reboot!
  4. Some OVS related changes you’ve made to objects OVS manages outside of OVS/OVSdb, e.g., manual setup of tc buckets will be destroyed.
  5. If you use XenServer, by upgrading OVS beyond what’s delivered from Citrix directly, you’re likely unsupported.

Here is a rough outline of the OVS upgrade process for an individual hypervisor:

  • Obtain Open vSwitch packages
  • Install Open vSwitch userspace components, kernel module(s) (see #3 and “Where things can really go awry”)
  • Load new Open vSwitch kernel module (/etc/init.d/openvswitch force-kmod-reload)
  • Simplified Ansible Playbook: https://gist.github.com/andyhky/9983421

The INSTALL file provides more detailed upgrade instructions. In the old days, upgrading Open vSwitch meant you had to either reboot your host or rebuild all of your flows because of the kernel module reload. After the introduction of the kernel module reloads, the upgrade process is more durable and less impacting.

Where things can really go awry

If your OS has a new kernel pending, e.g., after a XenServer service pack, you will want to install the packages for both your running kernel module and the one which will be running after reboot. Failing to do so can result in losing connectivity to your machine.

hoserville

It is not a guaranteed loss of networking when the Open vSwitch kernel module doesn’t match the xen kernel module, but it is a best practice to ensure they are in lock-step. The cases I’ve seen happen are usually significant version changes, e.g., 1.6 -> 1.11.

You can check if you’re likely to have a problem by running this code (XenServer only, apologies for quick & dirty bash):

#!/usr/bin/env bash
RUNNING_XEN_KERNEL=`uname -r | sed s/xen//`
PENDING_XEN_KERNEL=`readlink /boot/vmlinuz-2.6-xen  | sed s/xen// | sed s/vmlinuz-//`
OVS_BUILD=`/etc/init.d/openvswitch version | grep ovs-vswitchd | awk '{print $NF}'`
rpm -q openvswitch-modules-xen-$RUNNING_XEN_KERNEL-$OVS_BUILD > /dev/null
if [[ $? == 0 ]]
then
    echo "Current kernel and OVS modules match"
else
    CURRENT_MISMATCH=1
    echo "Current kernel and OVS modules do not match"
fi

rpm -q openvswitch-modules-xen-$PENDING_XEN_KERNEL-$OVS_BUILD > /dev/null
if [[ $? == 0 ]]
then
    echo "Pending kernel and OVS modules match"
else
    PENDING_MISMATCH=1
    echo "Pending kernel and OVS will not match after reboot. This can cause system instability."
fi

if [[ $CURRENT_MISMATCH == 1 || $PENDING_MISMATCH == 1 ]]
then
    exit 1
fi

Luckily, this can be rolled back. Access the host via DRAC/iLO and roll back the vmlinuz-2.6-xen symlink in /boot to one that matches your installed openvswitch-modules RPM. I made a quick and dirty bash script which can roll back, but it won’t be too useful unless you put the script on the server beforehand. Here it is (again, XenServer only):

#!/usr/bin/env bash
# Not guaranteed to work. YMMV and all that.
OVS_KERNEL_MODULES=`rpm -qa 'openvswitch-modules-xen*' | sed s/openvswitch-modules-xen-// | cut -d "-" -f1,2;`
XEN_KERNELS=`find /boot -name "vmlinuz*xen" \! -type l -exec ls -ld {} + | awk '{print $NF}'  | cut -d "-" -f2,3 | sed s/xen//`
COMMON_KERNEL_VERSION=`echo $XEN_KERNELS $OVS_KERNEL_MODULES | tr " " "\n"  | sort | uniq -d`
stat /boot/vmlinuz-${COMMON_KERNEL_VERSION}xen > /dev/null
if [[ $? == 0 ]]
then
    rm /boot/vmlinuz-2.6-xen
    ln -s /boot/vmlinuz-${COMMON_KERNEL_VERSION}xen /boot/vmlinuz-2.6-xen
else
    echo "Unable to find kernel version to roll back to! :(:(:(:("
fi