Upgrading Open vSwitch

Operating Open vSwitch brings a new set of challenges.

One of those challenges is managing Open vSwitch itself and making sure you’re up to date with performance and stability fixes. For example, in late 2013 there were significant performance improvements with the release of 1.11 (flow wildcarding!) and in the 2.x series there are even more improvements coming.

This means everyone running those old versions of OVS (I’m looking at you, <=1.6) should upgrade and get these huge performance gains.

There are a few things to be aware of when upgrading OVS:

  1. Reloading the kernel module is a data plane impacting event. It’s minimal. Most won’t notice, and the ones that do only see a quick blip. The duration of the interruption is a function of the number of ports and number of flows before the upgrade.
  2. Along those lines, if you orchestrate OVS kernel module reloads with parallel-ssh or Ansible or really any other tool, be mindful of the connection timeouts. All traffic on the host will be momentarily dropped, including your SSH connection! Set your SSH timeouts appropriately or bad things happen!
  3. Pay very close attention to kernel upgrades and OVS kernel module upgrades. Failure to do so could mean your host networking does not survive a reboot!
  4. Some OVS related changes you’ve made to objects OVS manages outside of OVS/OVSdb, e.g., manual setup of tc buckets will be destroyed.
  5. If you use XenServer, by upgrading OVS beyond what’s delivered from Citrix directly, you’re likely unsupported.

Here is a rough outline of the OVS upgrade process for an individual hypervisor:

  • Obtain Open vSwitch packages
  • Install Open vSwitch userspace components, kernel module(s) (see #3 and “Where things can really go awry”)
  • Load new Open vSwitch kernel module (/etc/init.d/openvswitch force-kmod-reload)
  • Simplified Ansible Playbook: https://gist.github.com/andyhky/9983421

The INSTALL file provides more detailed upgrade instructions. In the old days, upgrading Open vSwitch meant you had to either reboot your host or rebuild all of your flows because of the kernel module reload. After the introduction of the kernel module reloads, the upgrade process is more durable and less impacting.

Where things can really go awry

If your OS has a new kernel pending, e.g., after a XenServer service pack, you will want to install the packages for both your running kernel module and the one which will be running after reboot. Failing to do so can result in losing connectivity to your machine.

hoserville

It is not a guaranteed loss of networking when the Open vSwitch kernel module doesn’t match the xen kernel module, but it is a best practice to ensure they are in lock-step. The cases I’ve seen happen are usually significant version changes, e.g., 1.6 -> 1.11.

You can check if you’re likely to have a problem by running this code (XenServer only, apologies for quick & dirty bash):

#!/usr/bin/env bash
RUNNING_XEN_KERNEL=`uname -r | sed s/xen//`
PENDING_XEN_KERNEL=`readlink /boot/vmlinuz-2.6-xen  | sed s/xen// | sed s/vmlinuz-//`
OVS_BUILD=`/etc/init.d/openvswitch version | grep ovs-vswitchd | awk '{print $NF}'`
rpm -q openvswitch-modules-xen-$RUNNING_XEN_KERNEL-$OVS_BUILD > /dev/null
if [[ $? == 0 ]]
then
    echo "Current kernel and OVS modules match"
else
    CURRENT_MISMATCH=1
    echo "Current kernel and OVS modules do not match"
fi

rpm -q openvswitch-modules-xen-$PENDING_XEN_KERNEL-$OVS_BUILD > /dev/null
if [[ $? == 0 ]]
then
    echo "Pending kernel and OVS modules match"
else
    PENDING_MISMATCH=1
    echo "Pending kernel and OVS will not match after reboot. This can cause system instability."
fi

if [[ $CURRENT_MISMATCH == 1 || $PENDING_MISMATCH == 1 ]]
then
    exit 1
fi

Luckily, this can be rolled back. Access the host via DRAC/iLO and roll back the vmlinuz-2.6-xen symlink in /boot to one that matches your installed openvswitch-modules RPM. I made a quick and dirty bash script which can roll back, but it won’t be too useful unless you put the script on the server beforehand. Here it is (again, XenServer only):

#!/usr/bin/env bash
# Not guaranteed to work. YMMV and all that.
OVS_KERNEL_MODULES=`rpm -qa 'openvswitch-modules-xen*' | sed s/openvswitch-modules-xen-// | cut -d "-" -f1,2;`
XEN_KERNELS=`find /boot -name "vmlinuz*xen" \! -type l -exec ls -ld {} + | awk '{print $NF}'  | cut -d "-" -f2,3 | sed s/xen//`
COMMON_KERNEL_VERSION=`echo $XEN_KERNELS $OVS_KERNEL_MODULES | tr " " "\n"  | sort | uniq -d`
stat /boot/vmlinuz-${COMMON_KERNEL_VERSION}xen > /dev/null
if [[ $? == 0 ]]
then
    rm /boot/vmlinuz-2.6-xen
    ln -s /boot/vmlinuz-${COMMON_KERNEL_VERSION}xen /boot/vmlinuz-2.6-xen
else
    echo "Unable to find kernel version to roll back to! :(:(:(:("
fi

StatsD and multiple metrics

download

Measure all the things! Graphite & statsd are my weapons of choice. One set of metrics in particular that we wanted to measure are the various TCP stats, including TCP Retransmit rate. We crafted a Python script to send all of the metrics in a single UDP packet and hit a weird scenario.

The python script was all ready to roll except that StatsD was only logging one metric.  All of the metric packets were arriving at the StatsD instance, but only one was being processed.

Turns out this wasn’t always built into StatsD. It was added in 0.4.0 and exists in later versions. Upgrading StatsD fixes this problem.

Deep Dive: OpenStack Retrieving Nova Instance Console URLs with XVP and XenAPI/XenServer

This post is a deep dive into what happens in Nova (and where in the code) when a console URL is retrieved via the nova API for a Nova configuration backed by XVP and XenServer/XenAPI.  Hopefully the methods used in Nova’s code will not change over time, and this guide will remain good starting point.

Example nova client call:

nova get-vnc-console [uuid] xvpvnc

And the call returns:

+--------+-------------------------------------------------------------------------------------------------------+
| Type   | Url                                                                                                   |
+--------+-------------------------------------------------------------------------------------------------------+
| xvpvnc | https://URL:PORT/console?token=TOKEN |
+--------+-------------------------------------------------------------------------------------------------------+

One thing I particularly enjoy about console URL call in Nova is that it is synchronous  and has to reach all the way down to the VM level. Most calls in Nova are asynchronus, so console is a wonderful test of your cloud’s plumbing. If the call takes over rpc_response/rpc_cast_timeout (60/30 sec respectively), a 500 will bubble up to the user.

It helps to understand how XenServer consoles work in general.

  • XVP is an open source project which serves as a proxy to hypervisor console sessions. Interestingly enough, XVP is no longer used in Nova. The underpinnings of Console were changed in vnc-console-cleanup but the code is still around (console/xvp.py).
  • A XenServer VM has a console attribute associated with it. Console is an object in XenAPI.

This Deep Dive has two major sections:

  1. Generation of the console URL
  2. Accessing the console URL

How is the console URL generated?

console_url

 1) nova-api receives and validates the console request, and then makes a request to the compute API.

  • api/openstack/compute/contrib/consoles.py
  • def get_vnc_console

2) The compute RPC API receives the request and does two things: (2a) calls compute RPC API to gather connection information and (2b) call the console authentication service.

  • compute/api.py
  • def get_vnc_console

2a) The compute RPC receives the call from (1).  An authentication token is generated. For XVP consoles, a URL is generated which has FLAGS.xvpvncproxy_base_url and the generated token. driver.get_vnc_console is called.

  • compute/manager.py
  • def get_vnc_console

2a1) driver is an abstraction to the configured virt library, xenapi in this case. This just calls vmops get_vnc_console. XenAPI information is retrieved about the instance. The local to the hypervisor Xen Console URL generated and returned.

  • virt/xenapi/driver.py
  • def get_vnc_console
  • virt/xenapi/vmops.py
  • def get_vnc_console

2b) Taking the details from 2a1, the consoleauth RPC api is called. The token generated in (2a1) is added to memcache with CONF.console_token_ttl.

  • consoleauth/manager.py
  • def authorize_console

What happens when the console URL is accessed?

console_access

1) The request reaches nova-xvpvncproxy and a call to validate the token is made on the Console Auth RPC API

  • vnc/xvp_proxy.py
  • def __call__

2) The token in the request is checked against the token from the previous section (2b). Compute’s RPC API is called to validate the console’s port against the token’s port.

  • consoleauth/manager.py
  • def check_token
  • def _validate_token
  • compute/manager.py
  • def validate_console_port

3) nova-xvpvnc proxies a connection to the console on the hypervisor.

  • vnc/xvp_proxy.py
  • def proxy_connection

The Host Network Stack

This post is a collection of useful articles/videos that I’ve collected about networking on XenServer and Linux.

XenServer

Linux

As you can see, there are a multitude of elements to consider when looking into host networking issues for a Linux VM running on XenServer (which is Linux underneath the covers anyway).

Managing Nagios Configurations

There’s a good talk given by  Gabe Westmaas at the HK OpenStack Summit:

The talk describes what Rackspace monitors in the public cloud OpenStack deployment, how responses are handled, and some of the integration points that are used.  I recommend watching it for OpenStack specific monitoring and a little context around this post.

In this post I am going to discuss how the sausage gets made – how the underlying Nagios configuration is managed.

Some background: We have 3 classes of Nagios servers.

  1. Global - monitors global control plane nodes (e.g., glance-api, nova-api, nova-cells, cell nagios)
  2. Cell – monitors cell control plane nodes, and individual clusters of data plane nodes (e.g., compute nodes/hypervisors)
  3. Mixed – smaller environments – these are a combined cell/global

With Puppet, the Nagios node’s class is based on hostname, then the Nagios install/config puppet module is applied.

The Nagios puppet setup is pretty simple. It performs basic installation and configuration of Nagios along with pulling in a git repository of Nagios config files. The puppet modules/manifests change rarely, but the Nagios configuration itself has to change relatively frequently.

Types of changes to the Nagios configuration:

  1. Systems Lifecycle – normal bulk add/remove of service/host definitions. These are generated with some automation, currently a combination of Ansible and Python scripts which reach into other inventory systems.
  2. Gap Filling – as a result of RCAs or other efforts, gaps in the current monitoring configuration are identified. After the gap is identified, we need to ensure it is fully remediated in all existing datacenters and all new spin ups.
  3. Comestics/Tweaking – we perform analytics on our monitoring to prioritize/identify opportunities to automate remediation and/or deep dive into root causes. We have a logster parser running on each Nagios node which sends what/when/where on alerts to StatsD/Graphite.  Toward the analytics effort, we sometimes make changes to give all services more machine readable names.  We also tune monitoring thresholds for services that are too chatty or not chatty enough.

Changes #2 and #3  were drivers to put Nagios configuration files into a single repository.  Without a single repository, the en masse changes were cumbersome and didn’t get made. The configuration repository is laid out like this:

  • Shared configurations are stored in a common folder, each of which has a corresponding subfolder for the Nagios node class.
  • Service/Host definitions are stored in folders relative to their environments
  • All datacenters/environments are stored within the environments folder

The entire repository is cloned onto the Nagios node, and parts of which are copied and/or symlinked into /etc/nagios3/conf.d/ based on the Nagios node class and the environment.

For example:

  • nagios01.c0001.test.com: nagios class is cell (c0001 in the hostname), environment is test/c0001
  • /etc/nagios3/conf.d/ gets cfg files from the common/cell folder in the config repo
  • environments/test/c0001 is symlinked to  /etc/nagios3/conf.d/c0001/

This setup has been working well for us in production. It’s enabling first responders and engineers to make more meaningful changes faster to the monitoring stack at Rackspace.

Determining with Enabled VLANs from SNMP with Python

Similar to this thread, I wanted to see what VLANs were allowed for a trunked port as reported by SNMP with Python.

With the help of a couple of colleagues, I made some progress.

>>> vlan_value = '000000000020000000000000000000000000200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'
>>> for key,value in enumerate(format(int(vlan_value, 16), "0100b").rjust(len(vlan_value) * 4, '0')):
...     if value == '1':
...         print key
...
...
...
42
146
  • Convert the string returned to Hex
  • Convert that to Binary
  • Right fill 0s to the appropriate length to give offset (determined by the size of the string)
  • Loop through the resulting value and each character that is a 1 is an enabled VLAN on the port

In conjunction with LLDP, I’m able to query each switch/port and interface is connected to and determine if the VLANs are set properly on the port.

Personal Backups with Duply


backupstart

A month or two ago I finally went through all the old hard drives I’ve accumulated over the past decade. I mounted each of the disks and moved a bunch of files onto my desktop’s drive. There were lots of photos from the drives that I don’t want to lose so I decided to get a little more serious about backups.

I decided to give Duply a go. Duply is a wrapper for duplicity, which underneath it all uses the tried and trusted rsync.

  • Multiple Locations - I have duply configured to send various data to a USB Drive, Swift (Rackspace Cloud Files), and Another Server. These are easily configured with the .duply/backup scheme. 
  • Encrypted - duply works with GPG encryption
  • Customizable - duply has pre/post hooks which I leverage for notifications on backup success/failures 
  • Efficient - duply is capable of doing incremental backups and using compression

I’ve been really happy with testing restores with duply as well.

An example process that I have setup is as follows:

On my desktop system’s power resume, run an incremental backup to Swift. Notify on start and finish of the backups.

It required a little bit of Python and BASH to accomplish this but I’m happy with the end result. The scripts I used are published to Github under andyhky/duply-scripts. Getting started/installation are in the README.

backupcomplete

 

Network wiring with XenServer and Open vSwitch

In the physical world when you power on a server it’s already cabled (hopefully).

With VMs things are a bit different. Here’s the sequence of events when a VM is started in Nova and what happens on XenServer to wire it up with Open vSwitch.

VM_start

  1. nova-compute starts the VM via XenAPI
  2. XenAPI VM.start creates a domain and creates the VM’s vifs on the hypervisor
  3. The Linux user device manager manages receives this event, and scripts within /etc/udev/rules.d are fired in lexical order
  4. Xen’s vif plug script is fired, which at a minimum creates a port on the relevant virtual switch
    • Newer versions (XS 6.1+) of this plug script also have a setup-vif-rules script which creates several entries in the OpenFlow table (just grabbed from the code comments):
      • Allow DHCP traffic (outgoing UDP on port 67)
      • Filter ARP requests
      • Filter ARP responses
      • Allow traffic from specified ipv4 addresses
      • Neighbour solicitation
      • Neighbour advertisement
      • Allow traffic from specified ipv6 addresses
      • Drop all other neighbour discovery
      • Drop other specific ICMPv6 types
      • Router advertisement
      • Redirect gateway
      • Mobile prefix solicitation
      • Mobile prefix advertisement
      • Multicast router advertisement
      • Multicast router solicitation
      • Multicast router termination
      • Drop everything else
  5. Creation of the port on the virtual switch also adds entries into OVSDB, the database which backs Open vSwitch.
  6. ovs-xapi-sync, which starts on XenAPI/Open vSwitch startup has a local copy of the system’s state in memory. It checks for changes in Bridge/Interface tables, and pulls in XenServer specific data to other columns in those tables.
  7. On many events within OVSDB, including create/update of tables touched in these OVSDB operations, the OVS controller is notified via JSON RPC. Thanks Scott Lowe for clarification on this part.

After all of that happens, the VM boots the guest OS sets up its network stack.

Measuring Virtual Networking Overhead

After discussing [ovs-discuss] ovs performance on ‘worst case scenario’ with ovs-vswitchd up to 100%.  One of my colleagues had a good idea: tcpdump the physical interface and the vif at the same time. The difference between when the packet reaches the vif and the packet reaches the physical device can help measure the amount of time in a userspace->kernelspace transit. Of course, virtual switches aren’t the only culprit in virtual networking overhead- virtual networking is a very complex topic.

I created a new tool to help measure this overhead for certain traffic patterns: netweaver. There’s lots of info in the README, so head on by!

NetWeaver does the following:

  • Retrieve the vif details from the hypervisor
  • Start a traffic generating command on source instance(s)
  • Gather packet capture from destination instance’s hypervisor
  • Analyze the packet captures from the vif and eth devices
  • Perform some basic statistical analysis (average, max, min, stdev) on the result set

I intend on using this for analyzing various configurations with Xen, guest OSes, and Open vSwitch.