Remediation as a Service

I’ve seen a couple of automated remediation tools get some coverage lately:

And both have received interesting threads on Hacker News.

One HN commenter that stood out (bigdubs):

I don’t like being the voice of the purist in this, but this seems like a bandaid on a bullet wound.

For most of the cases where this would seem to be useful there is probably a failure upstream of that usage that should really be fixed.

The system’s operators have to keep the system up and running. If the bug can’t be diagnosed in a sane amount of time but there’s a clear remediation, the right choice is to get the system up and running.

A lot of operations teams employ this method: If Time to Diagnose > Time to Resolve Then Workaround. This keeps the site up and running, but covering a system in bandaids will lead to more bandaids and their complexity will increase.

The commenter has a point – if everything’s band-aided, the system’s behavior is unpredictable.  Another bad thing about automated remediation is that they can further the divide between developers and operators. With sufficient automated remediation, in many cases operators don’t have to involve developers for fixes. This is bad.

Good reasons to do automated remediation:

  • Not all failures can be attributed to and fixed with software (broken NIC, hard drive, CPU fan)
  • Most companies do not control their technology stack end to end; this also means shipping your own device driver or modified Linux kernel
  • For the technology stack that is under a company’s control, the time to deploy a bugfix is greater than the time to deploy an understood temporary workaround

There are opportunities for improving operations (and operators lives) by employing some sort of automated remediation solution, but it is not a panacea. Some tips for finding a sane balance with automated remediation:

  • Associate issues from an issue tracker and quantify the number of times each issue’s workaround is executed – no issue tracker, no automated remediation
  • Give the in-house and vendor software related workarounds a TTL
  • Mark the workarounds as diagnosed/undiagnosed, and invest effort in diagnosing the root cause of the workaround
  • Make gathering diagnostics the first part of the workaround

Finally – consider making investments in technology which enable useful diagnostic data to be gathered quickly and non disruptively. One example of this is OpenStack Nova’s Guru Mediation reports.

Operating OpenStack: Monitoring RabbitMQ

At the OpenStack Operators meetup the question was asked about monitoring issues that are related to RabbitMQ.  Lots of OpenStack components use a message broker and the most commonly used one among operators is RabbitMQ. For this post I’m going to concentrate on Nova and a couple of scenarios I’ve seen in production.

Screen Shot 2014-08-28 at 11.27.55

It’s important to understand the flow of messages amongst the various components and break things down into a couple of categories:

  • Services which publish messages to queues (arrow pointing toward the queue in the diagram)
  • Services which consume messages from queues (arrow pointing out from the queue in the diagram)

It’s also good to understand what actually happens when a message is consumed. In most cases, the consumer of the queue is writing to a database.

An example would be for an instance reboot, the nova-api publishes a message to a compute node’s queue. The compute service running polls for messages, receives the reboot, sends the reboot to the virtualization layer, and updates the instance’s state to rebooting. 

There are a couple of scenarios queue related issues manifest:

  1. Everything’s broken – easy enough, rebuild or repair the RabbitMQ server. This post does not focus on this scenario because there is a considerable amount of material around hardening RabbitMQ in the OpenStack documentation.
  2. Everything is slow and getting slower – this often points to a queue being published to at a greater rate than it can be consumed. This scenario is more nuanced, and requires an operator to know a couple of things: what queues are shared among many services and what are publish/consume rates during normal operations. 
  3. Some things are slow/not happening – some instance reboot requests go through, some do not. Generally speaking these operations are ‘last mile’ operations that involve a change on the instance itself. This scenario is generally restricted to a single compute node, or possibly a cabinet of compute nodes.

Baselines are very valuble to have in scenarios 2 and 3 to compare normal operations to in terms of RabbitMQ queue size/consumption rate. Without a baseline, it’s difficult to know if the behavior is out of normal operating conditions. 

There are a couple of tools that can help you out:

  • Diamond RabbitMQ collector (code, docs)- Send useful metrics from RabbitMQ to graphite, requires the RabbitMQ management plugin
  • RabbitMQ HTTP API – This enables operators to retrieve specific queue statistics instead of a view into an entire RabbitMQ server.
  • Nagios Rabbit Compute Queues – This is a script used with Nagios to check specified compute queues which helps determine if operations to a specific compute may get stuck. This helps what I referred to earlier as scenario 3. Usually a bounce of the nova-compute service helps these.  The script looks for a local config file which would allow access to the RabbitMQ management plugin. Example config file is in the gist.
  • For very real time/granular insight, run the following command on the RabbitMQ server:
    •   watch -n 0.5 ‘rabbitmqctl -p nova list_queues | sort -rnk2 |head’

Here is an example chart that can be produced with the RabbitMQ diamond collector which can be integrated into an operations dashboard:

Screen Shot 2014-08-28 at 11.19.18Baseline monitoring of the RabbitMQ servers themselves isn’t enough. I recommend an approach that combines the following:

  • Using the RabbitMQ management plugin (required)
  • Nagios checks on specific queues (optional)
  • Diamond RabbitMQ collector to send data to Graphite
  • Dashboard combining RabbitMQ installations statistics

Managing Nagios Configurations

There’s a good talk given by  Gabe Westmaas at the HK OpenStack Summit:

The talk describes what Rackspace monitors in the public cloud OpenStack deployment, how responses are handled, and some of the integration points that are used.  I recommend watching it for OpenStack specific monitoring and a little context around this post.

In this post I am going to discuss how the sausage gets made – how the underlying Nagios configuration is managed.

Some background: We have 3 classes of Nagios servers.

  1. Global – monitors global control plane nodes (e.g., glance-api, nova-api, nova-cells, cell nagios)
  2. Cell – monitors cell control plane nodes, and individual clusters of data plane nodes (e.g., compute nodes/hypervisors)
  3. Mixed – smaller environments – these are a combined cell/global

With Puppet, the Nagios node’s class is based on hostname, then the Nagios install/config puppet module is applied.

The Nagios puppet setup is pretty simple. It performs basic installation and configuration of Nagios along with pulling in a git repository of Nagios config files. The puppet modules/manifests change rarely, but the Nagios configuration itself has to change relatively frequently.

Types of changes to the Nagios configuration:

  1. Systems Lifecycle – normal bulk add/remove of service/host definitions. These are generated with some automation, currently a combination of Ansible and Python scripts which reach into other inventory systems.
  2. Gap Filling – as a result of RCAs or other efforts, gaps in the current monitoring configuration are identified. After the gap is identified, we need to ensure it is fully remediated in all existing datacenters and all new spin ups.
  3. Comestics/Tweaking – we perform analytics on our monitoring to prioritize/identify opportunities to automate remediation and/or deep dive into root causes. We have a logster parser running on each Nagios node which sends what/when/where on alerts to StatsD/Graphite.  Toward the analytics effort, we sometimes make changes to give all services more machine readable names.  We also tune monitoring thresholds for services that are too chatty or not chatty enough.

Changes #2 and #3  were drivers to put Nagios configuration files into a single repository.  Without a single repository, the en masse changes were cumbersome and didn’t get made. The configuration repository is laid out like this:

  • Shared configurations are stored in a common folder, each of which has a corresponding subfolder for the Nagios node class.
  • Service/Host definitions are stored in folders relative to their environments
  • All datacenters/environments are stored within the environments folder

The entire repository is cloned onto the Nagios node, and parts of which are copied and/or symlinked into /etc/nagios3/conf.d/ based on the Nagios node class and the environment.

For example:

  • nagios class is cell (c0001 in the hostname), environment is test/c0001
  • /etc/nagios3/conf.d/ gets cfg files from the common/cell folder in the config repo
  • environments/test/c0001 is symlinked to  /etc/nagios3/conf.d/c0001/

This setup has been working well for us in production. It’s enabling first responders and engineers to make more meaningful changes faster to the monitoring stack at Rackspace.

Using Swift and logrotate

Ever have an exchange like this?

Q: What happened <insert very very long time ago> on this service?
A: We can’t keep logs on the server past 2 months.  Those logs are gone.

Just about every IaaS out there has an object store. Amazon offers S3 and OpenStack providers have Swift. Why not just point logrotate at one of those object stores?

That’s just what I’ve done with Swiftrotate. It’s a simple shell script to use with logrotate. Config samples and more are in the project’s README.

NOTE: It doesn’t make a lot of sense to use without using dateext in logrotate. A lot of setups don’t use dateext, so there’s a utility script to rename all of your files to a dateext format.

Deep Dive: OpenStack Nova Snapshot Image Creation with XenAPI/XenServer and Glance

Based on currently available code (nova: a77c0c50166aac04f0707af25946557fbd43ad44 2012-11-02/python-glanceclient: 16aafa728e4b8309b16bcc120b10bc20372883f4 2012-11-07/glance: 9dae32d60fc285d03fdb5586e3368d229485fdb4)

This is a deep dive into what happens (and where in the code) during image creation with a Nova/Glance configuration that is backed by XenServer/XenAPI.  Hopefully the methods used in Glance/Nova’s code will not change over time, and this guide will remain good starting point.

Disclaimer: I am _not_ a developer, and these are really just best guesses. Corrections are welcome.


1) nova-api receives an imaging request. The request is validated, checking for a name and making sure the request is within quotas. Instance data is retrieved, as well as block device mappings. If the instance is volume backed, a separate compute API call is made to snapshot (self.compute_api.snapshot_volume_backed). For this deep dive, we’ll assume there is no block device mapping. self.compute_api.snapshot is called. The newly created image UUID is returned.

  • nova/api/openstack/compute/
  • def _action_create_image

2) The compute API gets the request and calls _create_image.  The instance’s task state is set to IMAGE_SNAPSHOT. Notifications are created of the state change. Several properties are collected about the image, including the minimum RAM, customer, and base image ref.The non inheritable instance_system meta data is also collected. (2a, 2b, 2c) self.image_service.create and (3) self.compute_rpcapi.snapshot_instance are called.

  • nova/compute/
  • def snapshot
  • def _create_image

2a) The collected metadata from 2 is put into a glance-friendly format, and sent to glance. The glance client’s create is called.

  • nova/image/
  • def create

2b) Glance (client) sends a POST the glance server to /v1/images with the gathered image metadata from (3).

  • glanceclient/v1/
  • def create

2c) Glance (server) receives the POST. Per the code comments:

Upon a successful save of the image data and metadata, a response
containing metadata about the image is returned, including its
opaque identifier.

  • glance/api/v1
  • def create
  • def _handle_source

3) Compute RPC API casts a message to the queue for the instance’s compute node.

  • nova/compute/
  • def snapshot_instance

4) The instance’s power state is read and updated. (4a) The XenAPI driver’s snapshot() is called. Notification is created for the snapshot’s start and end.

  • nova/compute/
  • def snapshot_instance

4a) The vmops snapshot is called (4a1).

  • nova/virt/xenapi/
  • def snapshot

4a1) The snapshot is created in XenServer via (4a1i) vm_utils, and (4a1ii) uploaded to glance. The code’s comments say this:

Steps involved in a XenServer snapshot:

1. XAPI-Snapshot: Snapshotting the instance using XenAPI. This
creates: Snapshot (Template) VM, Snapshot VBD, Snapshot VDI,
Snapshot VHD
2. Wait-for-coalesce: The Snapshot VDI and Instance VDI both point to
a ‘base-copy’ VDI. The base_copy is immutable and may be chained
with other base_copies. If chained, the base_copies
coalesce together, so, we must wait for this coalescing to occur to
get a stable representation of the data on disk.
3. Push-to-glance: Once coalesced, we call a plugin on the XenServer
that will bundle the VHDs together and then push the bundle into

  • nova/virt/xenapi/
  • def snapshot

4a1i) The instance’s root disk is recorded and its VHD parent is also recorded. The SR is recorded. The instance’s root VDI is snapshotted. Operations are blocked until a coalesce completes in _wait_for_vhd_coalesce (4a1i-1).

  • nova/virt/xenapi/
  • def snapshot_attached_here

4a1i-1) The end result of this process is outlined in the code comments:

Before coalesce:

* original_parent_vhd
    * parent_vhd

After coalesce:

* parent_vhd

In (4a1i) the original vdi uuid was recorded. The SR is scanned. In a nutshell, the code is ensuring that the desired layout above is met before allowing the snapshot to continue. The code polls CONF.xenapi_vhd_coalesce_max_attempts times and sleeps CONF.xenapi_vhd_coalesce_poll_interval: the SR is scanned. The original_parent_uuid is compared to the parent_uuid… if they don’t match we wait a while and check again for the coalescing to complete.

  • nova/virt/xenapi/
  • def _wait_for_vhd_coalesce

4a1ii) The glance API servers are retrieved from configuration. The glance upload_vhd XenAPI plugin is called.

  • nova/virt/xenapi/
  • def upload_image

4a2) A staging area is created, prepared, and _upload_tarball is called.

  • plugins/xenserver/xenapi/etc/xapi.d/plugins/glance
  • def upload_vhd

4a3) The staging area is prepared. This basically symlinks the snapshot VHDs to a temporary folder in the SR.

  • plugins/xenserver/xenapi/etc/xapi.d/plugins/
  • def prepare_staging_area

4a4) The comments say it best:

Create a tarball of the image and then stream that into Glance
using chunked-transfer-encoded HTTP.

A URL is constructed and a connection is opened to it. The image meta properties (like status) are collected and added as HTTP headers. The tarball is created, and streamed to glance in CHUNK_SIZE increments.  The HTTP stream is terminated, the connection checks for an OK from glance and reports accordingly.

  • plugins/xenserver/xenapi/etc/xapi.d/plugins/glance
  • def _upload_tarball

(Glance Server)

5) I’ve removed some of the obvious URL routing functions in glance to get down to the meat of this process. Basically, the PUT request goes to glance API.  The API interacts with the registry again, but this time there is data to be uploaded.  The image’s metadata is validated for activation, and then _upload_and_activate is called. _upload_and_activate is basically a way to call _upload and ensure that if it works, activate the image.  _upload checks to see if we’re copying, but we’re not. It also checks to see if the HTTP request is application/octet-stream. Then, an object store like swift is inferred from the request or used from the glance configuration (self.get_store_or_400). Finally, the image is added to the object store and its checksum is verified and the glance registry is updated. Notifications are also sent for image.upload.

  • glance/api/v1/
  • def update
  • def _handle_source
  • def _upload_and_activate
  • def _upload

Deep Dive: Openstack Nova Rescue Mode with XenAPI / XenServer

Based on currently available code (a77c0c50166aac04f0707af25946557fbd43ad44 2012-11-02)

This is a deep dive into what happens (and where in the code) during a rescue/unrescue scenario with a Nova configuration that is backed by XenServer/XenAPI.  Hopefully the methods used in Nova’s code will not change over time, and this guide will remain good starting point.


1) nova-api receives a rescue request. A new admin password is generated via utils.generate_password meeting FLAGS.password_length length requirement. The API calls rescue on the compute api.

  • nova/api/openstack/compute/contrib/
  • def _rescue

2) The compute API updates the vm_state to RESCUING, and calls the compute rpcapi rescue_instance with the same details.

  • nova/compute/
  • def rescue

3) The RPC API casts a rescue_instance message to the compute node’s message queue.

  • nova/compute/
  • def rescue_instance

4) nova-compute consumes the message in the queue containing the rescue request. The admin password is retrieved, if one was not passed this far one will be generated via utils.generate_password with the same flags as step 1. It then records details about the source instance, like networking and image details. The compute driver rescue function is called. After that (4a-4c) completes, the instance’s vm_state is updated to rescued.

  • nova/compute/
  • def rescue_instance

4a) This abstraction was skipped over in the last two deepdives, but for the sake of completeness: Driver.rescue is called. This just calls _vmops.rescue, where the real work happens.

  • nova/virt/xenapi/
  • def rescue

4b) Checks are performed to ensure the instance isn’t in rescue mode already. The original instance is shutdown via XenAPI. The original instance is bootlocked. A new instance is spawned with -rescue in the name-label.

  • nova/virt/xenapi/
  • def rescue

4c) A new VM is created just as all other VMs, with the source VM’s metadata. The root volume from the instance we are rescuing is attached as a secondary disk. The instance’s networking is the same, however the new hostname is RESCUE-hostname.

  • nova/virt/xenapi/
  • def spawn -> attach_disks_step rescue condition


1) nova-api receives an unrescue request.

  • nova/api/openstack/compute/contrib/
  • def _unrescue

2) The compute API updates the vm_state to UNRESCUING, and calls the compute rpcapi unrescue_instance with the same details.

  • nova/compute/
  • def unrescue

3) The RPC API casts an unrescue_instance message to the compute node’s message queue.

  • nova/compute/
  • def unrescue_instance

4) The compute manager receives the unrescue_instance message and calls the driver’s rescue method.

  • nova/compute/
  • def unrescue_instance

4a)  Driver.unrescue is called. This just calls _vmops.unrescue, where the real work happens.

  • nova/virt/xenapi/
  • def unrescue

4b) The rescue VM is found. Checks are done to ensure the VM is in rescue mode. The original VM is found. The rescue instance has _destroy_rescue_instance performed (4b1). After that completes, the source VM’s bootlock is released and the VM is started.

  • nova/virt/xenapi/
  • def unrescue

4b1) A hard shutdown is issued on the rescue instance. Via XenAPI, the root disk of the original instance is found. All VDIs attached  to the rescue instance are destroyed omitting the root of the original instance. The rescue VM is destroyed.

  • nova/virt/xenapi/
  • def _destroy_rescue_instance


Deep dive: OpenStack Nova Resize Down with XenAPI/Xenserver

Based on the currently available code (commit 114109dbf4094ae6b6333d41c84bebf6f85c4e48 – 2012-09-13)

This is a deep dive into what happens (and where in the code) during a resize down  (e.g., flavor 4 to flavor 2) with a Nova configuration that is backed by XenServer/XenAPI.  Hopefully the methods used in Nova’s code will not change over time, and this guide will remain good starting point.

Steps 1-6a are identical to my previous entry “Deep dive: OpenStack Nova Resize Up with XenAPI/Xenserver“. This deep dive will examine the divergence between resize up and resize down in Nova, as there are a few key differences.

6b) The instance resize progress gets an update. The VM is shutdown via XenAPI.

  • ./nova/virt/xenapi/
  • def _migrate_disk_resizing_down

6c) The source VDI is copied on the hypervisor via XenAPI VDI.copy.  Then, a different, new VDI is along with a VBD that it is plugged into the compute node.  The partition and filesystem of the new disk are resized via _resize_part_and_fs, using e2fsck, tune2fs,  parted, and tune2fs. The source VDI copy is also attached to nova-compute. Via _sparse_copy, which is configurable but by default true, nova-compute temporarily takes ownership of both devices (source read, dest write) and performs a block level copy, omitting zeroed blocks.

  • ./nova/virt/xenapi/
  • def _resize_part_and_fs
  • def _sparse_copy
  • nova/
  • def temporary_chown

6d) Progress is again updated. The devices that were attached are unplugged, and the VHDs are copied in the same fashion as outlined in steps 6a1i-6b2 from the deep dive on resizing up are used, aside from 6b2.

Deep Dive: OpenStack Nova Resize Up with XenAPI/Xenserver

Nova is the Compute engine of the OpenStack project.

Based on the currently available code (commit 114109dbf4094ae6b6333d41c84bebf6f85c4e48 – 2012-09-13)

This is a deep dive into what happens (and where in the code) during a resize up  (e.g., flavor 2 to flavor 4) with a Nova configuration that is backed by XenServer/XenAPI.  Hopefully the methods used in Nova’s code will not change over time, and this guide will remain good starting point.

Some abstractions such as go-between RPC calls and basic XenAPI calls have been deliberately ignored.

Disclaimer: I am _not_ a developer, and this is just my best guess through an overly-caffeinated code dive. Corrections are welcome.

1) API receives a resize request.

2) Request Validations performed.

3) Quota verifications are performed.

  • ./nova/compute/
  • def resize

4) Scheduler is notified to prepare the resize request. A target is selected for the resize and notified.

  • ./nova/scheduler/
  • def schedule_prep_resize

5) Usage notifications are sent as the resize is preparing. A migration entry is created in the nova database with the status pre-migrating.  resize_instance (6) is fired.

  • ./nova/compute/
  • def prep_resize

6) The migration record is updated to migrating. The instance’s task_state is updated from resize preparation to resize migrating. Usages receive notification that the resize has started. The instance is renamed on the source to have a suffix of -orig. (6a) migrate_disk_and_power_off is invoked. The migration record  is updated to post migrating. The instance’s task_state is updated to resize migrated. (6b) Finish resize is called.

  • ./nova/compute/
  • def resize_instance
  • def migrate_disk_and_power_off
  • ./nova/virt/xenapi/

6a) migrate_disk_and_power_off, where the work begins… Progress is zeroed out. A resize up or down is detected. We’re taking the resize up code path in 6a1.

  • ./nova/virt/xenapi/ -> ./nova/virt/xenapi/
  • def migrate_disk_and_power_off

6a1) Snapshot the instance. Via migrate_vhd, transfer the immutable VHDs, this is the base copy or the parent VHD belonging to the instance. Instance resize progress updated. Power down the instance. Again, via migrate_vhd (steps 6a1i-6a1v), transfer the COW VHD, or the changes which have occurred since the snapshot was taken.

  • ./nova/virt/xenapi/
  • def _migrate_disk_resizing_up

6a1i) Call the XenAPI plugin on the hypervisor to transfer_vhd

  • ./nova/virt/xenapi/
  • def _migrate_vhd

6a1ii) Make a staging on the source server to prepare the VHD transfer.

  • ./plugins/xenserver/xenapi/etc/xapi.d/plugins/migration
  • def transfer_vhd

6a1iii) Hard link the VHD(s) transferred to the staging area.

  • ./plugins/xenserver/xenapi/etc/xapi.d/plugins/
  • def prepare_staging_area

6a1iv) rsync the VHDs to the destination. The destination path is /images/instance-instance_uuid.

  • ./plugins/xenserver/xenapi/etc/xapi.d/plugins/migration
  • def _rsync_vhds

6a1v) Clean up the staging area which was created in 6d.

  • ./plugins/xenserver/xenapi/etc/xapi.d/plugins/
  • def cleanup_staging_area

6b) (6b1) Set up the newly transferred disk and turn on the instance on the destination host. Make the tenant’s quota usage reflect the newly resized instance.

  • ./nova/compute/
  • def finish_resize

6b1) The instance record is updated with the new instance_type (ram, CPU, disk, etc). Networking is set up on the destination (another day- best guess: nova-network, quantum, and the configured quantum plugin(s)  are notified of the networking changes). The instance’s task_state is set to resize finish. Usages are notified of the beginning of the end of the resize process. (6b1i) Finish migration is invoked. The instance record is updated to resized. The migration record is set to finished. Usages are notified that the resize has completed.

  • ./nova/compute/
  • def _finish_resize

6b1i) (6b1ii) move_disks is called. (6b2) _resize_instance is called. The destination VM is created and started via XenAPI. The resize’s progress is updated.

  • ./nova/virt/xenapi/ -> ./nova/virt/xenapi/
  •  def finish_migration

6b1ii) (6b1iii) The XenAPI plugin is called to move_vhds_into_sr. The SR is scanned. The Root VDI’s name-label and name-description are set to reflect the instance’s details.

  • ./nova/virt/xenapi/
  • def move_disks

6b1iii) Remember the VHD destination from step 6a1iv? I thought so! =) (6b1iv) Call import_vhds on the VHDs in that destination. Cleanup this staging area, just like 6a1v.

  • ./plugins/xenserver/xenapi/etc/xapi.d/plugins/migration
  • def move_vhds_into_sr

6b1iv) Move the VHDs from the staging area to the SR. Staging areas are used because if the SR is unaware of VHDs contained within, it could delete our data. This function assembles our VHD chain’s order, and assigns UUIDs to them. Then they are renamed from #.vhd to UUID.vhd. The VHDs are then linked to each other, starting from parent and moving down the tree. This is done via calls to vhd-util modify. The VDI chain is validated. Swap files are given a similar treatment, and appended to the list of files to move into the SR. All of these files are renamed (python os.rename) from the staging area into the host’s SR.

  • ./plugins/xenserver/xenapi/etc/xapi.d/plugins/
  • def import_vhds

6b2) The instance’s root disk is resized. The current size of the root disk is retrieved from XenAPI (virtual_size).  XenAPI VDI.resize_online ( XenServer > 5 ) is called to resize the disk to its new size as defined by the instance_type.

UPDATE: (2012-10-28) Added to step 6 the renaming to the instance-uuid-orig on the source.