Graphite Events with a Timestamp

There’s a few good posts out there about Graphite Events with how and why to use them.

Earlier I was trying to add events to Graphite but ran into an issue: my events used a timestamp in the past. The examples I found only showed publishing events with a ‘now’ timestamp.

I went digging and found the extension of Graphite to add events – the functionality exists.

Just add a ‘when’ to your payload with a Unix timestamp.

For example:

curl -X POST "http://graphite/events/" 
    -d '{"what": "Event - deploy", "tags": "deploy", 
         "when": 1418064254.340719}'

This functionality was in the original Graphite events release.

nsxchecker: Verify the health of your NSX network

screen-grabs-linksys-internet

Recently I got to work with the NSX API and write a tool to do a quick health check of NSX networks.

nsxchecker is a valuable operational tool to quickly report a NSX network’s health.  One of the promises of SDN is automated tooling for operational teams and with the NSX API I was quickly able to deliver.

Screen Shot 2014-10-06 at 17.00.10

nsxchecker accepts a NSX lswitch UUID or a neutron_net_id. Rackspace’s Neutron plugin, quark, tags created lports with a neutron_net_id. nsxchecker requires administrative access to the NSX controllers.

Neutron itself supports probes but it had a couple of drawbacks:

  1. It doesn’t work with all implementations
  2. For a large network, it’s slow

There’s more details in the README on github.

Operating OpenStack: Monitoring RabbitMQ

At the OpenStack Operators meetup the question was asked about monitoring issues that are related to RabbitMQ.  Lots of OpenStack components use a message broker and the most commonly used one among operators is RabbitMQ. For this post I’m going to concentrate on Nova and a couple of scenarios I’ve seen in production.

Screen Shot 2014-08-28 at 11.27.55

It’s important to understand the flow of messages amongst the various components and break things down into a couple of categories:

  • Services which publish messages to queues (arrow pointing toward the queue in the diagram)
  • Services which consume messages from queues (arrow pointing out from the queue in the diagram)

It’s also good to understand what actually happens when a message is consumed. In most cases, the consumer of the queue is writing to a database.

An example would be for an instance reboot, the nova-api publishes a message to a compute node’s queue. The compute service running polls for messages, receives the reboot, sends the reboot to the virtualization layer, and updates the instance’s state to rebooting. 

There are a couple of scenarios queue related issues manifest:

  1. Everything’s broken – easy enough, rebuild or repair the RabbitMQ server. This post does not focus on this scenario because there is a considerable amount of material around hardening RabbitMQ in the OpenStack documentation.
  2. Everything is slow and getting slower – this often points to a queue being published to at a greater rate than it can be consumed. This scenario is more nuanced, and requires an operator to know a couple of things: what queues are shared among many services and what are publish/consume rates during normal operations. 
  3. Some things are slow/not happening – some instance reboot requests go through, some do not. Generally speaking these operations are ‘last mile’ operations that involve a change on the instance itself. This scenario is generally restricted to a single compute node, or possibly a cabinet of compute nodes.

Baselines are very valuble to have in scenarios 2 and 3 to compare normal operations to in terms of RabbitMQ queue size/consumption rate. Without a baseline, it’s difficult to know if the behavior is out of normal operating conditions. 

There are a couple of tools that can help you out:

  • Diamond RabbitMQ collector (code, docs)- Send useful metrics from RabbitMQ to graphite, requires the RabbitMQ management plugin
  • RabbitMQ HTTP API – This enables operators to retrieve specific queue statistics instead of a view into an entire RabbitMQ server.
  • Nagios Rabbit Compute Queues – This is a script used with Nagios to check specified compute queues which helps determine if operations to a specific compute may get stuck. This helps what I referred to earlier as scenario 3. Usually a bounce of the nova-compute service helps these.  The script looks for a local config file which would allow access to the RabbitMQ management plugin. Example config file is in the gist.
  • For very real time/granular insight, run the following command on the RabbitMQ server:
    •   watch -n 0.5 ‘rabbitmqctl -p nova list_queues | sort -rnk2 |head’

Here is an example chart that can be produced with the RabbitMQ diamond collector which can be integrated into an operations dashboard:

Screen Shot 2014-08-28 at 11.19.18Baseline monitoring of the RabbitMQ servers themselves isn’t enough. I recommend an approach that combines the following:

  • Using the RabbitMQ management plugin (required)
  • Nagios checks on specific queues (optional)
  • Diamond RabbitMQ collector to send data to Graphite
  • Dashboard combining RabbitMQ installations statistics

Managing Nagios Configurations

There’s a good talk given by  Gabe Westmaas at the HK OpenStack Summit:

The talk describes what Rackspace monitors in the public cloud OpenStack deployment, how responses are handled, and some of the integration points that are used.  I recommend watching it for OpenStack specific monitoring and a little context around this post.

In this post I am going to discuss how the sausage gets made – how the underlying Nagios configuration is managed.

Some background: We have 3 classes of Nagios servers.

  1. Global – monitors global control plane nodes (e.g., glance-api, nova-api, nova-cells, cell nagios)
  2. Cell – monitors cell control plane nodes, and individual clusters of data plane nodes (e.g., compute nodes/hypervisors)
  3. Mixed – smaller environments – these are a combined cell/global

With Puppet, the Nagios node’s class is based on hostname, then the Nagios install/config puppet module is applied.

The Nagios puppet setup is pretty simple. It performs basic installation and configuration of Nagios along with pulling in a git repository of Nagios config files. The puppet modules/manifests change rarely, but the Nagios configuration itself has to change relatively frequently.

Types of changes to the Nagios configuration:

  1. Systems Lifecycle – normal bulk add/remove of service/host definitions. These are generated with some automation, currently a combination of Ansible and Python scripts which reach into other inventory systems.
  2. Gap Filling – as a result of RCAs or other efforts, gaps in the current monitoring configuration are identified. After the gap is identified, we need to ensure it is fully remediated in all existing datacenters and all new spin ups.
  3. Comestics/Tweaking – we perform analytics on our monitoring to prioritize/identify opportunities to automate remediation and/or deep dive into root causes. We have a logster parser running on each Nagios node which sends what/when/where on alerts to StatsD/Graphite.  Toward the analytics effort, we sometimes make changes to give all services more machine readable names.  We also tune monitoring thresholds for services that are too chatty or not chatty enough.

Changes #2 and #3  were drivers to put Nagios configuration files into a single repository.  Without a single repository, the en masse changes were cumbersome and didn’t get made. The configuration repository is laid out like this:

  • Shared configurations are stored in a common folder, each of which has a corresponding subfolder for the Nagios node class.
  • Service/Host definitions are stored in folders relative to their environments
  • All datacenters/environments are stored within the environments folder

The entire repository is cloned onto the Nagios node, and parts of which are copied and/or symlinked into /etc/nagios3/conf.d/ based on the Nagios node class and the environment.

For example:

  • nagios01.c0001.test.com: nagios class is cell (c0001 in the hostname), environment is test/c0001
  • /etc/nagios3/conf.d/ gets cfg files from the common/cell folder in the config repo
  • environments/test/c0001 is symlinked to  /etc/nagios3/conf.d/c0001/

This setup has been working well for us in production. It’s enabling first responders and engineers to make more meaningful changes faster to the monitoring stack at Rackspace.