Book Review: The Staff Engineer’s Path

The Staff Engineer’s Path. This book discusses the “Staff+”engineering career paths, and gives concrete advice for folks in those roles. If you are established or new in a Staff+ role or struggling between choosing the Engineering/Management pendulum, this book is for you.

What I liked: the entire book! Seriously. I enjoyed the Strategy/Vision section the most. The footnotes had great supplemental material.

What I didn’t like: The concept of a drawing a map felt repetitive. I like the concept, but it started to feel over-used later in the book.


A Sound(bar) Annoyance

This post is about fixing a nagging issue with audio on a Vizio sound bar. The problem: during periods of silence in videos, the audio would cut out compeltely and turn back on with a significant delay.

To illustrate:

This YouTube video has you guess the Disney theme. The theme you guess starts playing when a 15 second timer counts down. There’s a brief silence pause after each entry.

Except, in the bad default configuration, all audio would cut out until the timer reached 12-10 seconds. Grrr.

After some digging, reddit came to the rescue: Disable the VZtv Rmt setting. I also took the opportunity to disable the EcoPower setting.

These settings reset to its default ON across sound bar device firmware upgrades.

About the setup, all devices on latest firmware/OS:

  • LG 55” OLED TV
  • Vizio SB4051-C0 Sound Bar
  • Apple TV 4K
  • Nintendo Switch

There’s nothing exotic about the configuration. The sound bar is connected to TV’s HDMI ARC and the Apple TV via two Belkin UltraHD High Speed 4K HDMI Cables. The Nintendo Switch is connected directly to the TV via the HDMI cable that came with the Switch.

I hope this post helps others who are encountering this frustrating problem.

UPDATE (2022-04-02): Don’t do this kind of setup. I reconnected the Apple TV directly to the TV and the sound bar directly to the TV. This fixed all of the problems.

Book Review: API by Design

API by Design. This book is focused on REST APIs and introduces ways to measure complexity and some techniques to tame complexity. If you’ve ever worked on a REST API adversely impacted your WTFs/minute, this is a good book to consider.

What I liked: introducing new ways to communicate complexity, like Reference Entanglement (APIs with excessive references or arbitrary counts of nested resources). I also liked the section that quantified complexity based on optional parameters.

What I didn’t like: I wanted more examples! The book set the table for taming complexity in greenfield designs: I would have enjoyed more examples.

(Note: I know the author, Stephen Mizell, and my review is based on early access content.)

Book Review: Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations

Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. I feel like Accelerate was an OK book. I picked this book up after seeing it recommended from Charity’s blog. I was hopeful for some mind blowing content, but it mostly provided confirmation of some truths I’ve encountered in industry a decade ago.

What I liked: Research-backed conclusions about engineering methodologies. The book provides a reader some vocabulary, techniques, research, and links to business outcomes. If delivering software feels Sisyphean for you or your team, this book can better equip you to try and ameliorate that situation.

What I didn’t like: All of the exhaustive research methodology details. Yes, this matters, especially when making new claims. I did not feel like this fit into the book’s subtitle of “building and scaling high performing technology organizations”.

Overall: There are valid criticisms of the actual research methodologies of the DORA report, which powers a fair amount of the book. Your mileage may vary!

Book Review: Software Engineering at Google: Lessons Learned from Programming Over Time

Software Engineering at Google: Lessons Learned from Programming Over Time. I feel like this book gets less fanfare than the Google SRE books but it’s just as valuable. Like Google’s SRE books, this book has a mixture of useful information, wonderful explanations, and Google-flavored navel gazing.

What I liked: the first 16 chapters or so had me really engaged. I especially enjoyed the author’s clear definition of software engineering vs. programming/coding. Chapter 7 was very insightful about how a data driven process to measure engineering productivity.

What I didn’t like: the usual kool-aid you get from publications written by very large companies. While the authors admit that Google’s methods aren’t applicable everywhere (and not universally applied within Google), a lot of it felt particularly impractical especially in contrast to the earlier parts of the book.

On Call Run Books: There’s a Better Way

On call runbooks are a subset of a team’s runbooks used to assist on call responders. I’ve recently had conversations with a few folks about on call runbooks and I thought my view warranted a blog post.

In this post, I’ll describe on call runbooks, their pros and cons, and better places to invest developer time.

My views in this post assume two-pizza software engineering teams that are on call for own code.


An on call runbook usually follows the Issue->Problem->Resolution->Validation pattern.

Examples of On Call runbook sections:

  • Alarm: HTTP 5xx % exceeded threshold, here are the common problems/resolutions, here’s a way to assert success
  • Alarm: Disk Usage exceeded threshold, here are the common problems/resolutions, here’s a way to assert success
  • Alarm: Queue processing rate is too slow, here are the common problems/resolutions, here’s a way to assert success

The primary goal of an on call runbook is to provide a responder a way to resolve an alarm.

Pros/Cons of On Call Runbooks


  • Less time to ramp up new hire to on call.
  • Standard process for responding to “common alarms”.
  • Responders document alarm resolution.
  • Can document workarounds for dependencies outside of the team’s control (e.g., cloud provider, datastores, other teams, etc.)


  • Good runbooks require upkeep as the system changes. It is hard to maintain a good runbook because of difficulties detecting when a change to a system will need a corresponding runbook update.
  • On call runbook entries that don’t get updates become harmful to responders (especially the new hire), the runbook, and the system itself. At best, a stale entry resolves the issue. A stale entry could do nothing or actually harm the system, lowering responder confidence in the entry/runbooks entirely.
  • After its initial creation, any subsequent execution of a runbook entry is toil.
  • Runbooks only cover known-known failure modes. This is not comprehensive. There is no runbook for known-unknown and unknown-unknown failure modes.

🚨On Alarms🚨

Let’s not forget an important idea: Alarms should be rare and under exceptional circumstances. If there are common scenarios when an alarm fires, either the alarming is too sensitive, the service alarms on the wrong signals, or the service is wildly deficient and not meeting expectations. All of these are bad.

An oncall runbook cedes that some alarms are not exceptional. That is bad. I’ll echo Charity’s tweet here:

Even with run books, exceptional alarms happen. Exceptional alarms have no runbook entry. A responder has to determine the problem and its resolution.

Better Investments: Observability and CI/CD

Since runbooks don’t cover all failure modes, responders already must be able to diagnose/mitigate/resolve issues. This is a sunk cost.

Since responders have to diagnose/mitigate/resolve issues, they need to be able to make assertions about the system’s behavior. They also need to be able to change the system’s state. This need is compounded when the data needed to mitigate/resolve is not already available and it has to be created in situ (via patch/deployment).

Assertions about a systems behavior maps directly to observability; namely tracing, logs, and metrics.

Changing a system’s running state maps directly to CI/CD; quickly testing application changes and deployments. Frequently, exercising an application rollback mitigates a production issue!


Invest in observability and CI/CD over on call runbooks.

On The Big Rewrite

Content disappears from time to time on the web. This post is my own mirror of John Rauser’s Twitter thread that I frequently reference.

Inspired by this HN comment, I offer a story about software rewrites and Bezos as a technical leader.

I was once in an annual planning meeting at Amazon. In attendance was Jeff, every SVP at amazon, my VP and all the directors in my org.

One of the things we were proposing was a rip-and-replace rewrite of a relatively minor internal system.

As we were discussing other things, Jeff quietly got up and left the room.

This, in and of itself, was not a big deal. We all probably thought he just had to go to the bathroom.

He came back and rejoined the discussion. A few minutes later, an assistant came in and handed Jeff a small stack of papers.

Jeff brought up the proposed rewrite as he handed out the papers. They were copies of this

We all read silently. Go ahead and read it now.

No really, read it. It’s the entire point of the story:

When we were done he said (paraphrased): “If you drop this rewrite we need never talk about this again.”

“If you press on,” he continued, “you need to come back to me with a more complete justification and plan.”

We did not proceed with the rewrite.

OK Zoomer: My Zoom Pro Tips

I’ve been using Zoom on a daily basis for a while now. If you’re already a seasoned video conference user, these may help! These pro tips are focused on Mac OS use of Zoom, other platforms have similar functionality.

Gallery View

The default view in Zoom is called “Speaker View”. This keeps your view focused on whoever is speaking. Gallery View shows all participants at once. I prefer Gallery View for brief, informal team meetings.

In order to use Gallery View, you have to go to “Side by Side Mode” first. Then, you’ll see a button labeled “Speaker View” that toggles between Speaker View and Gallery View in the top right hand corner.

Speaker view, image taken from Zoom’s documentation
Gallery view, image taken from Zoom’s documentation

Keyboard Shortcuts

Judicious use of mute is essential for effective use of video conferencing. Don’t be the person who doesn’t mute!

The only two shortcuts I use on a regular basis (Mac):

  • Command(⌘)+Shift+V: Start/stop video
  • Command(⌘)+Shift+A: Mute/unmute audio

Zoom’s website has a comprehensive list of keyboard shortcuts.


For the vain users, Zoom has a snapchat-esque filter for touching up your appearance.

Zoom’s claim:

The ​​Touch Up My Appearance option retouches your video display with a soft focus. This can help smooth out the skin tone on your face, to present a more polished looking appearance when you display your video to others.

Touch Up My Appearance  (Zoom Documentation)
Settings -> Video Touch up My Appearance, image taken from Zoom’s documentation

TLS redirect security

A common technique to help TLS migrations is providing a redirect. For example, this blog, hosted on, redirects all HTTP requests on port 80 to one using TLS on port 443.

$ curl -v
* Rebuilt URL to:
*   Trying
* Connected to ( port 80 (#0)
> GET / HTTP/1.1
> Host:
> User-Agent: curl/7.54.0
> Accept: */*
> Referer:
< HTTP/1.1 301 Moved Permanently
< Server: nginx
< Date: Mon, 10 Feb 2020 15:15:43 GMT
< Content-Type: text/html
< Content-Length: 162
< Connection: keep-alive
< Location:

This looks innocuous. The client only received a 301 redirect. However, the entire HTTP request is sent over the wire in plaintext before the redirect is received by the client.

URIs are leaked in plaintext. Even Basic Access Authentication headers are also leaked. It’s true!

Wireshark screenshot where basic authorization credentials are visible over the wire (login:password)


Any security decision has trade offs. In most cases, like what provides, redirecting a non-TLS connection to TLS is the right decision.

In brownfield environment, customers may have bookmarks, search engines may have indexed the non-TLS address, and difficult to update code could be written to follow URLs that were stored in datastores. Not using TLS redirects can make your customers suffer.

In greenfield environment, only accept TLS and don’t offer HTTP redirects.

Home Office: January 2020

I spent the majority of the 2010s working from a home office.

Sometimes folks ask about my setup, so I thought it was time to show it all in a post. This post does not contain any affiliate links.

Desk in standing position


Video Conferencing

Microphone: Samson Meteor (2014)

Camera: Logitech HD Portable 1080p Webcam C615 with Autofocus (2014)

Light: Ring Light Webcam Mount (2019). I added a light after changing to a lower light room and reading Scott Hanselman’s Good, Better, Best post.


Why 3 pairs of headphones? I don’t have a great answer! I mainly use the Apple AirPods for phone conversations and walks, the AudioTechnicas for computer audio (listening on VC), and the Sony MX3s for music from my phone.

Computer Essentials

Monitor: Dell UltraSharp 30 Monitor with PremierColor: UP3017 (2017)

Laptop: Apple MacBook Pro 13-inch, 2016, Four Thunderbolt 3 Ports, 16 GB 2133 MHz LPDDR3 3.3 GHz Intel Core i7 (2017). Well-known keyboard issues aside, this laptop has been outstanding. Most days I never touch the laptop’s keyboard.

Keyboard: Microsoft Natural Ergonomic Keyboard 4000 v1.0 (2016).

Mouse: Microsoft Classic Intellimouse (2017). Overall, I’m happy with this mouse. Two nitpicks: 1) interior thumb button support on MacOS was lacking 2) there’s an LED underneath that gets unnecessarily bright.

USB Hub: Anker® USB 3.0 7-Port Hub with 36W Power Adapter [12V 3A High-Capacity Power Supply and VIA VL812 Chipset] (2014)

Everything Else


Desk: Jarvis Adjustable Height Desk, 72″ x 30 (2015). Jarvis was acquired by Fully. My single experience with Fully, requesting replacement grommets, was a positive one.

Standing Mat: Ergodriven Topo Mat (2017).

Chair: Herman Miller Cosm, graphite, tall back with adjustable arms (2019). Purchased at Design Within Reach in Seattle last year. The store had all of the Herman Miller models displayed and the salesperson was excellent to work with. The chair is crazy expensive! I’m considering it a long term investment.


Laptop Stand: Rain Design mStand (2017).

Sleeve: Cable Management Sleeve, JOTO Cord Management System for TV / Computer / Home Entertainment, 19 – 20 inch Flexible Cable Sleeve Wrap Cover Organizer, 4 Piece – Black (2017)

Under desk clips: 25pcs Adhesive Cable Clips Wire Clips Cable Wire Management Wire Cable Holder Clamps Cable Tie Holder for Car Office and Home (2017)

Headphones: Brainwavz Henja Headphones Hanger (2017)

Power Strip: Tripp Lite 6 Outlet Surge Protector Power Strip Clamp Mount (2017)

Notebook: National Brand Laboratory Notebook (2017).

Desk and chair in sitting position.

Office Ergonomics Review

Recently, I had a fair amount of back / neck pain. This surprised me since at $PREVJOB I was measured by a certified ergonomist and given measurements that were suitable for that setup.

I realized that it’d been several years since I was at $PREVJOB, and my desk and its contents have changed considerably. It was time to remeasure!

Some sites that helped me regain some comfort:

Ergonomic Office: Calculate optimal height of Desk, Chair / Standing Desk: This site has loads of graphics, calculators and details setting up a traditional desk as well as a standing desk.

Standing Height Calculator: While the Ergonomic Office site had recommendations for standing desk height, they differed from this site’s. This site had a better desk height calculation for me.

CBS News: Standing desk dilemma: Too much time on your feet? This article has lots of standing desk specific information that was useful about how to stand (and how not to stand). (via r/standingdesk)

The most surprising change for me was the 20 degree monitor adjustment that a few of the sites mention. This is a bit more important for folks that wear specific types of glasses like I do.

Finally, don’t forget about ergonomics in the car. During the holidays I had a lot of back pain but I wasn’t at my desk… I was driving a lot more than normal. I found this illustration by Lee Sullivan to be useful:

A Fix for OperationalError: (psycopg2.OperationalError) SSL error: decryption failed or bad record mac

I kept seeing this error message with a uwsgi/Flask/SQLAlchemy application:

[2019-09-01 23:43:34,378] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/flask/", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python2.7/site-packages/flask/", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python2.7/site-packages/flask/", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python2.7/site-packages/flask/", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python2.7/site-packages/flask/", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "./", line 11, in foo
    session.execute('SELECT 1')
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/", line 1269, in execute
    clause, params or {}
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/", line 988, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/sql/", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/", line 1107, in _execute_clauseelement
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/", line 1248, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/", line 1466, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/util/", line 398, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/", line 1244, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/", line 552, in do_execute
    cursor.execute(statement, parameters)
OperationalError: (psycopg2.OperationalError) SSL error: decryption failed or bad record mac

This issue was difficult to figure out. In fact, I was annoyed enough I made a simple reproduction repo. Searches of StackOverflow did not offer great solutions. The closest I found was the following post:

uWSGI, Flask, sqlalchemy, and postgres: SSL error: decryption failed or bad record mac

The accepted solution suggests using lazy-apps = true in uwsgi.ini. This isn’t a great solution because of the increased memory use. When I enabled this on my application, it slowed down noticeably, but the problem was gone!

The post gave me a clue about the nature of the problem. When uwsgi starts, the python application is imported, then N forks (called workers in uwsgi) of the application are copied (N = processes in uwsgi.ini). This post from Ticketea engineering illustrated this nicely.

Most of the time this is OK! My application’s particular use of psycopg2 was problematic. On the initial application startup, a database connection/engine was created with SQLalchemy. That same connection was copied to the uwsgi worker processes. Psycopg documentation calls out uses like this in its documentation:

The Psycopg module and the connection objects are thread-safe: many threads can access the same database either using separate sessions and creating a connection per thread or using the same connection and creating separate cursors. In DB API 2.0 parlance, Psycopg is level 2 thread safe.

The documentation continues, with the important point:

The above observations are only valid for regular threads: they don’t apply to forked processes nor to green threads. libpq connections shouldn’t be used by a forked processes, so when using a module such as multiprocessing or a forking web deploy method such as FastCGI make sure to create the connections after the fork.

A ha! My connection that was created on application import/startup needed to be torn down! This is an easy enough fix: calling engine.dispose after I’ve used the import/startup time connection.

CSV Imports into Google Cloud BigQuery

If you have a large Postgres database, consider trying Google Cloud BigQuery for OLAP-style reporting queries.

database> COPY (<query_to_denomralize_report_data>) TO '<dest_filename>' DELIMITER ',' CSV HEADER;

Then upload it to GCS:

$ gsutil cp <dest_filename> gs://<dest_bucket>/<dest_filename>

Then, follow Loading Data into BigQuery (with Google Cloud Storage). For this to work, you’ll have to define the schema and skip the first leading row.

My most recent load job took 4 seconds, loading a very large CSV file. Query outputs can be saved to Google Sheets or other BigQuery tables.

For those of you who are looking for something similar on AWS, check out Amazon Athena.

python-dateutil gotcha

A quick note: Don’t naively use dateutil.parser.parse() if you know your dates are ISO8601 formatted. Use dateutil.parser.isoparse() instead.

In [1]: from dateutil.parser import *
In [2]: datestr = '2016-11-10 05:11:03.824266'
In [3]: %timeit parse(datestr)
The slowest run took 4.78 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 211 µs per loop
In [4]: %timeit isoparse(datestr)
100000 loops, best of 3: 17.1 µs per loop

On Lifelong Learning

One of my goals is lifelong learning.

Since finishing school over a decade ago, most of my learning has taken place on the job or through reading technical books in the last decade.  I often feel that books provide great surveys and details on theory (which are important!) but they’re light on useful information for practitioners.


Coursera was the first experience I had with online classes. I’ve done three courses on Coursera: Computer Networks (available here), Algorithms Part I, and Algorithms Part II.  These are college courses that are available online. The Coursera experience worked well for me by introducing broad, completely new concepts with world class lecturers. For me, there wasn’t much practical value in the assignments/exercises because of the time commitment required.

Coursera is great to introduce a broad new set of topics and most importantly, learn how to ask questions about the subject.

I didn’t attend a top-tier engineering school, but online courses helped me get through interview processes at “big” tech firms.


YouTube and Twitch are amazing resources for learning new skills. Some of the  GDB debugging videos are what I’ve been watching lately!

Our local public library partners with the courses there are free. This is a tremendous, low-risk opportunity to take guided tours of new technologies!

See previous learning posts from 2009, 2012.

Book Review: Designing Data-Intensive Applications


Some days I feel that I’ve returned to my old DBA role, well… sort of. A lot of my day to day work revolves around a vast amount of data and applications for data storage and retrieval. I haven’t been a DBA by trade since 2011, nearly an eternity in the tech industry.

To me, Designing Data-Intensive Applications was a wonderful survey of “What’s changed?” as well as “What’s still useful?” regarding all things data. I found Designing had practical, implementable material, especially around newer document databases.

This book is a good balance of theory and practice. I particularly enjoyed the breakdown of concurrency related problems and adding concrete names to classes of issues I’ve seen in practice (e.g., phantom reads, dirty reads, etc).

The section about distributed systems theory was well done but it was a definite lull in the rest of the book’s flow.

The book’s last section and last chapter were both strong. I would recommend this book to all of my colleagues.

Labels on Cloud Dataflow Instances for Cloud BigTable Exports

Google Cloud Platform has the notion of resource labels which are often useful in tracking ownership and cost reporting.

Google Cloud BigTable supports data exports as sequence files. The process uses Cloud Dataflow. Cloud Dataflow ends up spinning up Compute Engine VMs for this processing, up to –maxNumWorkers.

I wanted to see what this was costing to run on a regular basis, but the VMs are ephemeral in nature and unlabelled. There’s an option not mentioned on the Google documentation to accomplish this task!

$ java -jar bigtable-beam-import-1.4.0-shaded.jar export --help=org.apache.beam.runners.dataflow.options.DataflowPipelineOptions
 Options that configure the Dataflow pipeline.

 Labels that will be applied to the billing records for this job.


Looking at PostGIS intersects queries

I’ve been looking hard at datastores recently, in particular PostgreSQL + PostGIS. In school, I did a few GIS related things in class and I assumed it was magic. It turns out there is no magic.

At work, we have an application that accepts polygons and returns objects whose location are within that polygon. In GIS-speak this is often called an Area of Interest or AOI and they’re often represented as a few data types: GeoJSON, Well-known Text and Well-known Binary (WKT/WKB).

So what happens when the request is accepted and how does PostGIS make this all work?

Here’s a graphical representation of my GeoJSON (the ‘raw’ link should show the GeoJSON document):

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw


hosted with ❤ by GitHub

And the same as WKT:

POLYGON ((-88.35548400878906 36.57362961247927, -88.24562072753906 36.57362961247927, -88.24562072753906 36.630406976495614, -88.35548400878906 36.630406976495614, -88.35548400878906 36.57362961247927))

view raw


hosted with ❤ by GitHub

In the PostGIS world, geospatial data is stored as a geometry type.

Here’s a query returning WKT->Geometry:

db # SELECT ST_GeometryFromText('POLYGON ((-88.35548400878906 36.57362961247927, -88.24562072753906 36.57362961247927, -88.24562072753906 36.630406976495614, -88.35548400878906 36.630406976495614, -88.35548400878906 36.57362961247927))
-[ RECORD 1 ]——-+——————————————————————————————————————————————————————————————-
st_geometryfromtext | 0103000000010000000500000000000040C01656C0CDCEF4B16C49424000000040B80F56C0CDCEF4B16C49424000000040B80F56C0059C012DB150424000000040C01656C0059C012DB150424000000040C01656C0CDCEF4B16C494240

view raw


hosted with ❤ by GitHub

The PostGIS geometry datatype is an lwgeom struct.

So, to find all of the objects in a given AOI, the SQL query would look something like this (PostGIS FAQ):

SELECT id FROM objects WHERE geom && ST_GeometryFromText('POLYGON ((-88.35548400878906 36.57362961247927, -88.24562072753906 36.57362961247927, -88.24562072753906 36.630406976495614, -88.35548400878906 36.630406976495614, -88.35548400878906 36.57362961247927))')

OK, but what does the && operator do? Thanks to this awesome dba.stackexchange answer, we finally get some answers! In a nutshell:

  • && is short for geometry_overlaps
  • geometry_overlaps passes through several layers of abstraction (the answer did not show them all)
  • in the end, floating point comparisons see if the object in the database’s polygon is within the requested polygon


GPG Suite and gpg-agent forwarding

I had to fight a little with GPG Suite (Mac) and forwarding gpg-agent to a Ubuntu 16.04 target. This post describes what ended up working for me.

Source Machine:

  • macOS Sierra 10.12.6
  • GPG Suite 2017.1 (2002)

Target Machine:

  • gpg2 installed on Ubuntu 16.04 (sudo apt-get install gpg2 -y)

Source machine:

File: ~/.gnupg/gpg-agent.conf

Add this lines:

extra-socket /Users/[user]/.gnupg/S.gpg-agent.remote

File: ~/.ssh/config

Add these lines:

RemoteForward /home/[user]/.gnupg/S.gpg-agent /Users/[user]/.gnupg/S.gpg-agent.remote
ExitOnForwardFailure Yes

Restart gpg-agent:

gpgconf --kill gpg-agent
gpgconf --launch gpg-agent

Destination Machine:

File: /etc/ssh/sshd_config

Add this line:

StreamLocalBindUnlink yes

Restart sshd:

sudo systemctl restart sshd


This article helped a lot while I was troubleshooting.

Book Review: Mastering Ansible

While debugging an Ansible playbook, I tweeted about variable precedence:

Through that tweet I learned how to spell mnemonic AND I got a recommendation to buy Mastering Ansible.  This blog post is a quick review of Mastering Ansible.

Overall, I enjoyed the book’s format more than anything else. The flow of objective/code/screenshot really made the concepts gel (using herp/derp in examples is a close second!). I’m an advanced Ansible user (over 3 years), and this book still showed me new parts of Ansible. The book had very useful suggestions for playbook optimization and simplification. Mastering Ansible also provided some nifty tips and tricks like Jinja Macros, ansible-vault with a script, and merging hashes that I didn’t know were possible.

Of course the book answered my variable precedence question – another thing it did was provide some clarity around using variables with includes which was what I really needed.

If you’re just getting your feet wet with Ansible or you’ve been using it for years, Mastering Ansible is worth picking up.

F5 Rolling Deployments with Ansible

Rolling deployments is well covered in the Ansible Docs. This post is about using those rolling deployments patterns but with F5 Load Balancers.  These techniques use the F5 modules require BIG-IP software version >= 11 and have been tested with Ansible 1.9.4.

There are a couple of things to examine before doing one of these deployments:

  • Preflight Checks
  • Ansible’s pre_tasks / post_tasks

Preflight Checks

If you’re using a pair of F5s for High Availability, consider having a preflight check play to ensure your configurations are in sync and determine your primary F5:

- name: Collect BIG-IP facts
  local_action: >
      server={{ item }}
      user={{ f5_user }}
      password={{ f5_password }}
      with_items: f5_hosts

- name: determine primary F5
  set_fact: primary_lb="{{ device[item]['management_address'] }}"
  when: device[item]['failover_state'] == 'HA_STATE_ACTIVE'
  with_items: device.keys()

- name: fail when not in sync
  fail: msg="not proceeding, f5s aren't in sync"
  when: device_group['/Common/device_trust_group']['device']|length > 1 and
device_group['/Common/device_trust_group']['sync_status']['status'] != 'In Sync'

Recommendation: Use Automatic Sync for doing automated operations against a pair of F5s. I’ve listed the failsafe here for the sake of completeness.

Beware: the bigip_facts module sets facts named “device_group, device” or any other parameter you pass to ‘include’.

pre_tasks & post_tasks

The pre_tasks section looks similar- disable/drain. There’s a new variable introduced- more on that at the end of this post.

  - name: disable node
    local_action: >
      bigip_node: server={{ primary_lb }}
                  user={{ f5_user }}
                  password={{ f5_pass }}
                  name={{ inventory_hostname }}

   - name: wait for connections to drain
     wait_for: host={{ inventory_hostname }}
               exclude_hosts={{ ','.join(heartbeat_ips) }}

The post_tasks section isn’t much different. No new variables to worry about, just resume connections and re-enable the node.

   - name: wait for connections to resume
     wait_for: host={{ inventory_hostname }}

  - name: enable node
    local_action: >
      bigip_node: server={{ primary_lb }}
                  user={{ f5_user }}
                  password={{ f5_pass }}
                  name={{ inventory_hostname }}

A note about heartbeat_ips:

heartbeat_ips. These are the source IP addresses that the F5 uses to health check your node. heartbeat_ips can be discovered on each F5 by navigating to Network -> Self IPs.  If you don’t know how to find them, disable your node in the F5 and run something like tcpdump -i eth0 port 443. The heartbeat IPs can be discovered by adding “self_ip” to the bigip_facts module’s include parameter – usually represented like this:


There are two downsides discovering heartbeat IPs:

1) They have to be gathered from each F5 and concatenated together. The way the bigip_facts module operates this requires use of a register. (it gets messy)

exclude_hosts={{ f5_facts.results|join(',', attribute='ansible_facts.self_ip./Common/internal-self.address') }}

2) The bigip_facts gathering module is pretty time consuming- I’ve seen up to 30s per F5 depending on the amount of data returned. Storing these IPs elsewhere is much faster =)

Giving FlowBAT a try

FlowBAT is a graphical flow based analysis tool.

FlowBAT is a very snappy UI which wraps flow collection software called SiLK. My interests with FlowBAT and SiLK are around using Open vSwitch to push sFlow data to SiLK and FlowBAT to quickly query that data. Hopefully I’ll get around to posting about OVS, sFlow, SiLK and FlowBAT.

While I was getting started with FlowBAT (and SiLK) I saw there weren’t any ansible roles for either component. The FlowBAT team maintains some shell scripts to install and configure FlowBAT and the SiLK project has SiLK on a box. I ported both of these projects to ansible roles and posted them to ansible-galaxy.

To give them both a try in a Vagrant environment, check out vagrant-flowbat.

Remediation as a Service

First aid kit  - Marcin Wichary -
First aid kit – Marcin Wichary

I’ve seen a couple of automated remediation tools get some coverage lately:

And both have received interesting threads on Hacker News.

One HN commenter that stood out (bigdubs):

I don’t like being the voice of the purist in this, but this seems like a bandaid on a bullet wound.

For most of the cases where this would seem to be useful there is probably a failure upstream of that usage that should really be fixed.

The system’s operators have to keep the system up and running. If the bug can’t be diagnosed in a sane amount of time but there’s a clear remediation, the right choice is to get the system up and running.

A lot of operations teams employ this method: If Time to Diagnose > Time to Resolve Then Workaround. This keeps the site up and running, but covering a system in bandaids will lead to more bandaids and their complexity will increase.

The commenter has a point – if everything’s band-aided, the system’s behavior is unpredictable.  Another bad thing about automated remediation is that they can further the divide between developers and operators. With sufficient automated remediation, in many cases operators don’t have to involve developers for fixes. This is bad.

Good reasons to do automated remediation:

  • Not all failures can be attributed to and fixed with software (broken NIC, hard drive, CPU fan)
  • Most companies do not control their technology stack end to end; this also means shipping your own device driver or modified Linux kernel
  • For the technology stack that is under a company’s control, the time to deploy a bugfix is greater than the time to deploy an understood temporary workaround

There are opportunities for improving operations (and operators lives) by employing some sort of automated remediation solution, but it is not a panacea. Some tips for finding a sane balance with automated remediation:

  • Associate issues from an issue tracker and quantify the number of times each issue’s workaround is executed – no issue tracker, no automated remediation
  • Give the in-house and vendor software related workarounds a TTL
  • Mark the workarounds as diagnosed/undiagnosed, and invest effort in diagnosing the root cause of the workaround
  • Make gathering diagnostics the first part of the workaround

Finally – consider making investments in technology which enable useful diagnostic data to be gathered quickly and non disruptively. One example of this is OpenStack Nova’s Guru Mediation reports.

Remote Work Update

My desk.
My setup at home. Logitech Camera not pictured.


Over a year ago, I left San Antonio and started working from home in Lexington, KY. (previously)

AV Tools

Microphone: Still very happy with the Meteor microphone.

Camera: Purchased a Logitech HD Portable 1080p Webcam C615 with Autofocus. Works with OSX 10.9.4 well enough. The only pains were reconfiguring various VC software to use the new camera. Hangouts users, beware: every Flash update you must to set your non-default video/mic.

I enjoy the dedicated camera instead of using the one on the MacBook. I work with dual displays, and having to properly position the Macbook or worry about looking into the camera on the Macbook was a bit awkward.

Chat Tools

The majority of my collaboration is on IRC every day. My basic setup uses the Textual client connecting to a ZNC instance running on a server.

I like being very upfront with my status. The usual way I see this done via IRC is ‘nickname’ and ‘nickname_away’ or ‘nickname_zzz’. Don’t do this – it’s the digital equivalent of bad manners. IRC has /away builtin, just use it. Lots of clients are aware of ‘away’ status. ZNC has a nice module called ‘simpleaway‘- when you are not connected to ZNC, it sets your status away.

Turns out you can overshare. After I returned from being idle, help vampires sent messages nearly instantly. ZNC’s ‘antiidle‘ module helps counter this by hiding your real idle time.

Finally, our team worked on IRC bots. Some examples:

  • Single bot command to launch a telepresence/VC session with the entire team
  • Music player control
  • Fun sounds playing in the conference room
  • RSS feed streamed to channel for the team’s Redmine/Jira projects


Remote work is awesome. I’m thankful Rackspace gave me the opportunity for remote work. I’m also thankful the team I work made remote work painless and rewarding.

Troubleshooting LLDP


LLDP is a wonderful protocol which paints a picture of datacenter topology. lldpd is a daemon to run on your servers to receive LLDP frames outputs network location and more.  There’s also a recently patched lldp Ansible module.

Like all tools, using LLDP/lldpd has had some issues. Here’s the ones I’ve seen in practice, with diagnosis and resolution:

Switch isn’t configured to send LLDP frames


tcpdump -i eth0 -s 1500 -XX -c 1 'ether proto 0x88cc'

Switches will send the LLDP frames by default every 30s. The switchport’s configuration needs to enable LLDP.

Host isn’t reporting LLDP frames

Generally, this means lldpd isn’t running on the server. If the lldp frames are arriving (from the above tcpdump), but lldpctl will returns nothing.


lldpctl # returns nothing
pgrep -f lldpd # returns nothing
service lldpd restart

Be sure that the lldpd service is set to run at boot and take a look at configuration options.

NIC is dropping LLDP frame

By far the most frustrating- NIC firmware issues which can cause the NIC to drop lldp frames. (Page 10, item 14)

The way this one manifests:

  • lldpctl reports nothing
  • lldpd is running
  • switch is configured to send LLDP frames


Run a packet capture on the switch to ensure that the LLDP frames are being sent to the port. If you’re able to see the frame go out on the wire and traffic is otherwise functioning normally to the host, the problem lies with the NIC.

The fix here was to apply the NIC firmware upgrade- after that, lldp was good to go!

SDN Development Environment

Recently, I began a deep dive into more SDN and OpenFlow. Overall I was very happy with the process and quality of the material out there for newcomers.

However, I noticed a gap when I hit my first stumbling block. I set up a mininet instance, noticed it was running Open vSwitch (OVS) v2.0. I needed a newer version of OVS, and turfed the mininet instance while the upgrading OVS. It quickly became apparent that I needed a repeatable development environment setup.

I created ansible-sdn-dev to help out with this problem.

ansible-sdn-dev includes Ansible roles to build, install and configure these applications:

ansible-sdn-dev also includes a Vagrantfile so you can clone the repository, issue vagrant up and start hacking!

Graphite Events with a Timestamp

There’s a few good posts out there about Graphite Events with how and why to use them.

Earlier I was trying to add events to Graphite but ran into an issue: my events used a timestamp in the past. The examples I found only showed publishing events with a ‘now’ timestamp.

I went digging and found the extension of Graphite to add events – the functionality exists.

Just add a ‘when’ to your payload with a Unix timestamp.

For example:

curl -X POST "http://graphite/events/" 
    -d '{"what": "Event - deploy", "tags": "deploy", 
         "when": 1418064254.340719}'

This functionality was in the original Graphite events release.

My OpenStack Paris Summit Sessions

Capacity Management/Provisioning (Cloud’s full, Can’t build here) – Video, Slides

As a service provider, Rackspace is constantly bringing new OpenStack capacity online. In this session, we will detail a myriad of challenges around adding new compute capacity. These include: planning, automation, organizational, quality assurance, monitoring, security, networking, integration, and more.

Managing Open vSwitch Across a Large Heterogenous Fleet – VideoSlides

Open vSwitch (OVS) is one of the more popular ways to provide VM connectivity in OpenStack. Rackspace has been using Open vSwitch in production since late 2011. In this session, we will detail the challenges faced with managing and upgrading Open vSwitch across a large heterogenous fleet. Finally, we will share some of the tools we have created to monitor OVS availability and performance.

Specific topics covered will include:

Why upgrade OVS?
Measuring OVS
Minimizing downtime with upgrades
Bridge fail modes
Kernel module gotchas
Monitoring OVS

nsxchecker: Verify the health of your NSX network


Recently I got to work with the NSX API and write a tool to do a quick health check of NSX networks.

nsxchecker is a valuable operational tool to quickly report a NSX network’s health.  One of the promises of SDN is automated tooling for operational teams and with the NSX API I was quickly able to deliver.

Screen Shot 2014-10-06 at 17.00.10

nsxchecker accepts a NSX lswitch UUID or a neutron_net_id. Rackspace’s Neutron plugin, quark, tags created lports with a neutron_net_id. nsxchecker requires administrative access to the NSX controllers.

Neutron itself supports probes but it had a couple of drawbacks:

  1. It doesn’t work with all implementations
  2. For a large network, it’s slow

There’s more details in the README on github.

Operating OpenStack: Monitoring RabbitMQ

At the OpenStack Operators meetup the question was asked about monitoring issues that are related to RabbitMQ.  Lots of OpenStack components use a message broker and the most commonly used one among operators is RabbitMQ. For this post I’m going to concentrate on Nova and a couple of scenarios I’ve seen in production.

Screen Shot 2014-08-28 at 11.27.55

It’s important to understand the flow of messages amongst the various components and break things down into a couple of categories:

  • Services which publish messages to queues (arrow pointing toward the queue in the diagram)
  • Services which consume messages from queues (arrow pointing out from the queue in the diagram)

It’s also good to understand what actually happens when a message is consumed. In most cases, the consumer of the queue is writing to a database.

An example would be for an instance reboot, the nova-api publishes a message to a compute node’s queue. The compute service running polls for messages, receives the reboot, sends the reboot to the virtualization layer, and updates the instance’s state to rebooting. 

There are a couple of scenarios queue related issues manifest:

  1. Everything’s broken – easy enough, rebuild or repair the RabbitMQ server. This post does not focus on this scenario because there is a considerable amount of material around hardening RabbitMQ in the OpenStack documentation.
  2. Everything is slow and getting slower – this often points to a queue being published to at a greater rate than it can be consumed. This scenario is more nuanced, and requires an operator to know a couple of things: what queues are shared among many services and what are publish/consume rates during normal operations. 
  3. Some things are slow/not happening – some instance reboot requests go through, some do not. Generally speaking these operations are ‘last mile’ operations that involve a change on the instance itself. This scenario is generally restricted to a single compute node, or possibly a cabinet of compute nodes.

Baselines are very valuble to have in scenarios 2 and 3 to compare normal operations to in terms of RabbitMQ queue size/consumption rate. Without a baseline, it’s difficult to know if the behavior is out of normal operating conditions. 

There are a couple of tools that can help you out:

  • Diamond RabbitMQ collector (code, docs)- Send useful metrics from RabbitMQ to graphite, requires the RabbitMQ management plugin
  • RabbitMQ HTTP API – This enables operators to retrieve specific queue statistics instead of a view into an entire RabbitMQ server.
  • Nagios Rabbit Compute Queues – This is a script used with Nagios to check specified compute queues which helps determine if operations to a specific compute may get stuck. This helps what I referred to earlier as scenario 3. Usually a bounce of the nova-compute service helps these.  The script looks for a local config file which would allow access to the RabbitMQ management plugin. Example config file is in the gist.
  • For very real time/granular insight, run the following command on the RabbitMQ server:
    •   watch -n 0.5 ‘rabbitmqctl -p nova list_queues | sort -rnk2 |head’

Here is an example chart that can be produced with the RabbitMQ diamond collector which can be integrated into an operations dashboard:

Screen Shot 2014-08-28 at 11.19.18Baseline monitoring of the RabbitMQ servers themselves isn’t enough. I recommend an approach that combines the following:

  • Using the RabbitMQ management plugin (required)
  • Nagios checks on specific queues (optional)
  • Diamond RabbitMQ collector to send data to Graphite
  • Dashboard combining RabbitMQ installations statistics

Monitoring Edge Node Network Configuration

Over the last few months I’ve done a bit of work around monitoring, Open vSwitch, and XenServer. This post lists some of the networking/Open vSwitch specific items to monitor on hypervisors.

  • Link StatusNagios SNMP Interfaces plugin works well for reporting a failed link as well as reporting error rates and inbound/outbound bandwidth.
  • Open vSwitch Manager and Controller Status: Transport Node Status is a quick and dirty python script which can be used with extended SNMP to alert when OVS loses a connection to a manager/controller. Beware of an influx of false alarms after upgrading Open vSwitch.
  • Open vSwitch Kernel Modules:  OVS KMOD (XenServer specific) is another quick bash script which can be used to monitor potential OVS kernel mismatch issues detailed in Upgrading Open vSwitch.
  • SDN Integration processes: Nagios SNMP process check. With XenServer, the ovs-xapi-sync process must be running for proper integration between SDN controllers and ofport/vif objects on the hypervisor.

Are there other network-specific things you monitor for hypervisors running OpenStack? Leave ’em in the comments.

Interested in Open vSwitch? Check the Open vSwitch category for a few more posts.
Interested in Monitoring? Check Managing Nagios Configurations.

On Failure

A couple of interesting research papers around failure, found in The Datacenter as a Computer.

Failure Trends in a Large Disk Drive Population (2007)

Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives.

Temperature Management in Data Centers: Why Some (Might) Like It Hot (2012)

Based on our study of data spanning more than a dozen data centers at three different organizations, and covering a broad range of reliability issues, we find that the effect of high data center temperatures on system reliability are smaller than often assumed.

On Working Remote

Screen Shot 2014-04-26 at 18.58.06

In late March I relocated from San Antonio, TX to Lexington,
KY. Same awesome job just with a twist…REMOTE WORK!

I am mainly collaborating via IRC, tmux, 1:1/M:M TeamSpeak, and 1:1/M:M video conferencing.

My takeaways after the first month:

  • Obvious, but it took me a while: When you’re talking, LOOK AT THE CAMERA – multiple monitors and a MacBook make this a little awkward
  • Quality matters: appropriate microphones & video, particularly conference rooms, make a huge difference. Recommend Yeti for conference rooms and Samson Meteor for individuals
  • tmux is great but not great for everything… for long massive jobs redirect the output to a file and just have everyone tail -f the file
  • Google Hangouts Keyboard Shortcuts


Upgrading Open vSwitch

Operating Open vSwitch brings a new set of challenges.

One of those challenges is managing Open vSwitch itself and making sure you’re up to date with performance and stability fixes. For example, in late 2013 there were significant performance improvements with the release of 1.11 (flow wildcarding!) and in the 2.x series there are even more improvements coming.

This means everyone running those old versions of OVS (I’m looking at you, <=1.6) should upgrade and get these huge performance gains.

There are a few things to be aware of when upgrading OVS:

  1. Reloading the kernel module is a data plane impacting event. It’s minimal. Most won’t notice, and the ones that do only see a quick blip. The duration of the interruption is a function of the number of ports and number of flows before the upgrade.
  2. Along those lines, if you orchestrate OVS kernel module reloads with parallel-ssh or Ansible or really any other tool, be mindful of the connection timeouts. All traffic on the host will be momentarily dropped, including your SSH connection! Set your SSH timeouts appropriately or bad things happen!
  3. Pay very close attention to kernel upgrades and OVS kernel module upgrades. Failure to do so could mean your host networking does not survive a reboot!
  4. Some OVS related changes you’ve made to objects OVS manages outside of OVS/OVSdb, e.g., manual setup of tc buckets will be destroyed.
  5. If you use XenServer, by upgrading OVS beyond what’s delivered from Citrix directly, you’re likely unsupported.

Here is a rough outline of the OVS upgrade process for an individual hypervisor:

  • Obtain Open vSwitch packages
  • Install Open vSwitch userspace components, kernel module(s) (see #3 and “Where things can really go awry”)
  • Load new Open vSwitch kernel module (/etc/init.d/openvswitch force-kmod-reload)
  • Simplified Ansible Playbook:

The INSTALL file provides more detailed upgrade instructions. In the old days, upgrading Open vSwitch meant you had to either reboot your host or rebuild all of your flows because of the kernel module reload. After the introduction of the kernel module reloads, the upgrade process is more durable and less impacting.

Where things can really go awry

If your OS has a new kernel pending, e.g., after a XenServer service pack, you will want to install the packages for both your running kernel module and the one which will be running after reboot. Failing to do so can result in losing connectivity to your machine.


It is not a guaranteed loss of networking when the Open vSwitch kernel module doesn’t match the xen kernel module, but it is a best practice to ensure they are in lock-step. The cases I’ve seen happen are usually significant version changes, e.g., 1.6 -> 1.11.

You can check if you’re likely to have a problem by running this code (XenServer only, apologies for quick & dirty bash):

#!/usr/bin/env bash
RUNNING_XEN_KERNEL=`uname -r | sed s/xen//`
PENDING_XEN_KERNEL=`readlink /boot/vmlinuz-2.6-xen  | sed s/xen// | sed s/vmlinuz-//`
OVS_BUILD=`/etc/init.d/openvswitch version | grep ovs-vswitchd | awk '{print $NF}'`
rpm -q openvswitch-modules-xen-$RUNNING_XEN_KERNEL-$OVS_BUILD > /dev/null
if [[ $? == 0 ]]
    echo "Current kernel and OVS modules match"
    echo "Current kernel and OVS modules do not match"

rpm -q openvswitch-modules-xen-$PENDING_XEN_KERNEL-$OVS_BUILD > /dev/null
if [[ $? == 0 ]]
    echo "Pending kernel and OVS modules match"
    echo "Pending kernel and OVS will not match after reboot. This can cause system instability."

    exit 1

Luckily, this can be rolled back. Access the host via DRAC/iLO and roll back the vmlinuz-2.6-xen symlink in /boot to one that matches your installed openvswitch-modules RPM. I made a quick and dirty bash script which can roll back, but it won’t be too useful unless you put the script on the server beforehand. Here it is (again, XenServer only):

#!/usr/bin/env bash
# Not guaranteed to work. YMMV and all that.
OVS_KERNEL_MODULES=`rpm -qa 'openvswitch-modules-xen*' | sed s/openvswitch-modules-xen-// | cut -d "-" -f1,2;`
XEN_KERNELS=`find /boot -name "vmlinuz*xen" \! -type l -exec ls -ld {} + | awk '{print $NF}'  | cut -d "-" -f2,3 | sed s/xen//`
COMMON_KERNEL_VERSION=`echo $XEN_KERNELS $OVS_KERNEL_MODULES | tr " " "\n"  | sort | uniq -d`
stat /boot/vmlinuz-${COMMON_KERNEL_VERSION}xen > /dev/null
if [[ $? == 0 ]]
    rm /boot/vmlinuz-2.6-xen
    ln -s /boot/vmlinuz-${COMMON_KERNEL_VERSION}xen /boot/vmlinuz-2.6-xen
    echo "Unable to find kernel version to roll back to! :(:(:(:("

StatsD and multiple metrics


Measure all the things! Graphite & statsd are my weapons of choice. One set of metrics in particular that we wanted to measure are the various TCP stats, including TCP Retransmit rate. We crafted a Python script to send all of the metrics in a single UDP packet and hit a weird scenario.

The python script was all ready to roll except that StatsD was only logging one metric.  All of the metric packets were arriving at the StatsD instance, but only one was being processed.

Turns out this wasn’t always built into StatsD. It was added in 0.4.0 and exists in later versions. Upgrading StatsD fixes this problem.

Deep Dive: OpenStack Retrieving Nova Instance Console URLs with XVP and XenAPI/XenServer

This post is a deep dive into what happens in Nova (and where in the code) when a console URL is retrieved via the nova API for a Nova configuration backed by XVP and XenServer/XenAPI.  Hopefully the methods used in Nova’s code will not change over time, and this guide will remain good starting point.

Example nova client call:

nova get-vnc-console [uuid] xvpvnc

And the call returns:

| Type   | Url                                                                                                   |
| xvpvnc | https://URL:PORT/console?token=TOKEN |

One thing I particularly enjoy about console URL call in Nova is that it is synchronous  and has to reach all the way down to the VM level. Most calls in Nova are asynchronus, so console is a wonderful test of your cloud’s plumbing. If the call takes over rpc_response/rpc_cast_timeout (60/30 sec respectively), a 500 will bubble up to the user.

It helps to understand how XenServer consoles work in general.

  • XVP is an open source project which serves as a proxy to hypervisor console sessions. Interestingly enough, XVP is no longer used in Nova. The underpinnings of Console were changed in vnc-console-cleanup but the code is still around (console/
  • A XenServer VM has a console attribute associated with it. Console is an object in XenAPI.

This Deep Dive has two major sections:

  1. Generation of the console URL
  2. Accessing the console URL

How is the console URL generated?


 1) nova-api receives and validates the console request, and then makes a request to the compute API.

  • api/openstack/compute/contrib/
  • def get_vnc_console

2) The compute RPC API receives the request and does two things: (2a) calls compute RPC API to gather connection information and (2b) call the console authentication service.

  • compute/
  • def get_vnc_console

2a) The compute RPC receives the call from (1).  An authentication token is generated. For XVP consoles, a URL is generated which has FLAGS.xvpvncproxy_base_url and the generated token. driver.get_vnc_console is called.

  • compute/
  • def get_vnc_console

2a1) driver is an abstraction to the configured virt library, xenapi in this case. This just calls vmops get_vnc_console. XenAPI information is retrieved about the instance. The local to the hypervisor Xen Console URL generated and returned.

  • virt/xenapi/
  • def get_vnc_console
  • virt/xenapi/
  • def get_vnc_console

2b) Taking the details from 2a1, the consoleauth RPC api is called. The token generated in (2a1) is added to memcache with CONF.console_token_ttl.

  • consoleauth/
  • def authorize_console

What happens when the console URL is accessed?


1) The request reaches nova-xvpvncproxy and a call to validate the token is made on the Console Auth RPC API

  • vnc/
  • def __call__

2) The token in the request is checked against the token from the previous section (2b). Compute’s RPC API is called to validate the console’s port against the token’s port.

  • consoleauth/
  • def check_token
  • def _validate_token
  • compute/
  • def validate_console_port

3) nova-xvpvnc proxies a connection to the console on the hypervisor.

  • vnc/
  • def proxy_connection

The Host Network Stack

This post is a collection of useful articles/videos that I’ve collected about networking on XenServer and Linux.



As you can see, there are a multitude of elements to consider when looking into host networking issues for a Linux VM running on XenServer (which is Linux underneath the covers anyway).

Managing Nagios Configurations

There’s a good talk given by  Gabe Westmaas at the HK OpenStack Summit:

The talk describes what Rackspace monitors in the public cloud OpenStack deployment, how responses are handled, and some of the integration points that are used.  I recommend watching it for OpenStack specific monitoring and a little context around this post.

In this post I am going to discuss how the sausage gets made – how the underlying Nagios configuration is managed.

Some background: We have 3 classes of Nagios servers.

  1. Global – monitors global control plane nodes (e.g., glance-api, nova-api, nova-cells, cell nagios)
  2. Cell – monitors cell control plane nodes, and individual clusters of data plane nodes (e.g., compute nodes/hypervisors)
  3. Mixed – smaller environments – these are a combined cell/global

With Puppet, the Nagios node’s class is based on hostname, then the Nagios install/config puppet module is applied.

The Nagios puppet setup is pretty simple. It performs basic installation and configuration of Nagios along with pulling in a git repository of Nagios config files. The puppet modules/manifests change rarely, but the Nagios configuration itself has to change relatively frequently.

Types of changes to the Nagios configuration:

  1. Systems Lifecycle – normal bulk add/remove of service/host definitions. These are generated with some automation, currently a combination of Ansible and Python scripts which reach into other inventory systems.
  2. Gap Filling – as a result of RCAs or other efforts, gaps in the current monitoring configuration are identified. After the gap is identified, we need to ensure it is fully remediated in all existing datacenters and all new spin ups.
  3. Comestics/Tweaking – we perform analytics on our monitoring to prioritize/identify opportunities to automate remediation and/or deep dive into root causes. We have a logster parser running on each Nagios node which sends what/when/where on alerts to StatsD/Graphite.  Toward the analytics effort, we sometimes make changes to give all services more machine readable names.  We also tune monitoring thresholds for services that are too chatty or not chatty enough.

Changes #2 and #3  were drivers to put Nagios configuration files into a single repository.  Without a single repository, the en masse changes were cumbersome and didn’t get made. The configuration repository is laid out like this:

  • Shared configurations are stored in a common folder, each of which has a corresponding subfolder for the Nagios node class.
  • Service/Host definitions are stored in folders relative to their environments
  • All datacenters/environments are stored within the environments folder

The entire repository is cloned onto the Nagios node, and parts of which are copied and/or symlinked into /etc/nagios3/conf.d/ based on the Nagios node class and the environment.

For example:

  • nagios class is cell (c0001 in the hostname), environment is test/c0001
  • /etc/nagios3/conf.d/ gets cfg files from the common/cell folder in the config repo
  • environments/test/c0001 is symlinked to  /etc/nagios3/conf.d/c0001/

This setup has been working well for us in production. It’s enabling first responders and engineers to make more meaningful changes faster to the monitoring stack at Rackspace.

Determining Enabled VLANs from SNMP with Python

Similar to this thread, I wanted to see what VLANs were allowed for a trunked port as reported by SNMP with Python.

With the help of a couple of colleagues, I made some progress.

vlan_value = '000000000020000000000000000000000000200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'
for key,value in enumerate(format(int(vlan_value, 16), "0100b").rjust(len(vlan_value) * 4, '0')):
...     if value == '1':
...         print key
  • Convert the string returned to Hex
  • Convert that to Binary
  • Right fill 0s to the appropriate length to give offset (determined by the size of the string)
  • Loop through the resulting value and each character that is a 1 is an enabled VLAN on the port

In conjunction with LLDP, I’m able to query each switch/port and interface is connected to and determine if the VLANs are set properly on the port.

Personal Backups with Duply


A month or two ago I finally went through all the old hard drives I’ve accumulated over the past decade. I mounted each of the disks and moved a bunch of files onto my desktop’s drive. There were lots of photos from the drives that I don’t want to lose so I decided to get a little more serious about backups.

I decided to give Duply a go. Duply is a wrapper for duplicity, which underneath it all uses the tried and trusted rsync.

  • Multiple Locations – I have duply configured to send various data to a USB Drive, Swift (Rackspace Cloud Files), and Another Server. These are easily configured with the .duply/backup scheme. 
  • Encrypted – duply works with GPG encryption
  • Customizable – duply has pre/post hooks which I leverage for notifications on backup success/failures 
  • Efficient – duply is capable of doing incremental backups and using compression

I’ve been really happy with testing restores with duply as well.

An example process that I have setup is as follows:

On my desktop system’s power resume, run an incremental backup to Swift. Notify on start and finish of the backups.

It required a little bit of Python and BASH to accomplish this but I’m happy with the end result. The scripts I used are published to Github under andyhky/duply-scripts. Getting started/installation are in the README.



Network wiring with XenServer and Open vSwitch

In the physical world when you power on a server it’s already cabled (hopefully).

With VMs things are a bit different. Here’s the sequence of events when a VM is started in Nova and what happens on XenServer to wire it up with Open vSwitch.


  1. nova-compute starts the VM via XenAPI
  2. XenAPI VM.start creates a domain and creates the VM’s vifs on the hypervisor
  3. The Linux user device manager manages receives this event, and scripts within /etc/udev/rules.d are fired in lexical order
  4. Xen’s vif plug script is fired, which at a minimum creates a port on the relevant virtual switch
    • Newer versions (XS 6.1+) of this plug script also have a setup-vif-rules script which creates several entries in the OpenFlow table (just grabbed from the code comments):
      • Allow DHCP traffic (outgoing UDP on port 67)
      • Filter ARP requests
      • Filter ARP responses
      • Allow traffic from specified ipv4 addresses
      • Neighbour solicitation
      • Neighbour advertisement
      • Allow traffic from specified ipv6 addresses
      • Drop all other neighbour discovery
      • Drop other specific ICMPv6 types
      • Router advertisement
      • Redirect gateway
      • Mobile prefix solicitation
      • Mobile prefix advertisement
      • Multicast router advertisement
      • Multicast router solicitation
      • Multicast router termination
      • Drop everything else
  5. Creation of the port on the virtual switch also adds entries into OVSDB, the database which backs Open vSwitch.
  6. ovs-xapi-sync, which starts on XenAPI/Open vSwitch startup has a local copy of the system’s state in memory. It checks for changes in Bridge/Interface tables, and pulls in XenServer specific data to other columns in those tables.
  7. On many events within OVSDB, including create/update of tables touched in these OVSDB operations, the OVS controller is notified via JSON RPC. Thanks Scott Lowe for clarification on this part.

After all of that happens, the VM boots the guest OS sets up its network stack.

Measuring Virtual Networking Overhead

After discussing [ovs-discuss] ovs performance on ‘worst case scenario’ with ovs-vswitchd up to 100%.  One of my colleagues had a good idea: tcpdump the physical interface and the vif at the same time. The difference between when the packet reaches the vif and the packet reaches the physical device can help measure the amount of time in a userspace->kernelspace transit. Of course, virtual switches aren’t the only culprit in virtual networking overhead- virtual networking is a very complex topic.

I created a new tool to help measure this overhead for certain traffic patterns: netweaver. There’s lots of info in the README, so head on by!

NetWeaver does the following:

  • Retrieve the vif details from the hypervisor
  • Start a traffic generating command on source instance(s)
  • Gather packet capture from destination instance’s hypervisor
  • Analyze the packet captures from the vif and eth devices
  • Perform some basic statistical analysis (average, max, min, stdev) on the result set

I intend on using this for analyzing various configurations with Xen, guest OSes, and Open vSwitch.

Deep Dive: HTB Rate Limiting (QoS) with Open vSwitch and XenServer

DISCLAIMER: I’m still getting my feet wet with Open vSwitch. This post is just a cleaned up version of my scratchpad.

Open vSwitch has a few ways of providing rate limiting – this deep dive will go into the internals of reverse engineering an existing virtual interface’s egress rate limits applied with tc-htb. Hierarchy Token Bucket (htb) is a standard linux packet scheduling implementation. More reading on HTB can be done on the author’s site – I found the implementation and theory pretty interesting.

This is current as of Open vSwitch 1.9.

The information needed to retrieve htb rate limits mostly lives in the ovsdb:

Open vSwitch Schema (

Things can get complex depending on how your vifs plug into your physical interfaces. In my case, OpenStack Quantum requires an integration bridge which I’ve attempted to diagram:


  1. On instance boot, vifs are plugged into xapi0. xapi0’s controller nodes pull down information including flows and logical queues.
  2. The flows pulled from (1) set the destination queue on all traffic for the source IP address for the interface.
  3. The queue which the traffic gets sent to goes to a linux-htb ring where the packets are scheduled.

Let’s take a look at an example. I want to retrieve the rate limit according to the hypervisor for vif2.1 which connects to xapi0, xenbr1, and the physical interface eth1. The IP address is


  • Find the QoS used by the physical interface:
    # ovs-vsctl find Port name=eth1 | grep qos
    qos : 678567ed-9f71-432b-99a2-2f28efced79c

  • Determine which queue is being used for your virtual interface. The value after set_queue is our queue_id.
    # ovs-ofctl dump-flows xapi0 | grep | grep "set_queue"
    ... ,nw_src= actions=set_queue:13947, ...
  • List the QoS from the first step and its type. NOTE: This command outputs every single OpenFlow queue_id/OVS Queue UUID for the physical interface. The queue_id from the previous step will be the key we’re interested in and the value is our Queue’s UUID
    # ovs-vsctl list Qos 678567ed-9f71-432b-99a2-2f28efced79c | egrep 'queues|type'
    queues : { ... 13947=787b609b-417c-459f-b9df-9fb5b362e815,... }
    type : linux-htb
  • Use the Queue UUID from the previous step to list the Queue:
    # ovs-vsctl list Queue 787b609b-417c-459f-b9df-9fb5b362e815 | grep other_config
    other_config : {... max-rate="614400000" ...}
  • In order to tie it back to tc-htb we have to convert the OpenFlow queue_id+1 to hexadecimal (367c). I think it’s happening here in the OVS code, but I’d love to have a definitive answer.
    # tc -s -d class show dev eth1 | grep 367c | grep ceil # Queue ID + 1 in Hex
    class htb 1:367c ... ceil 614400Kbit

Using Swift and logrotate

Ever have an exchange like this?

Q: What happened <insert very very long time ago> on this service?
A: We can’t keep logs on the server past 2 months.  Those logs are gone.

Just about every IaaS out there has an object store. Amazon offers S3 and OpenStack providers have Swift. Why not just point logrotate at one of those object stores?

That’s just what I’ve done with Swiftrotate. It’s a simple shell script to use with logrotate. Config samples and more are in the project’s README.

NOTE: It doesn’t make a lot of sense to use without using dateext in logrotate. A lot of setups don’t use dateext, so there’s a utility script to rename all of your files to a dateext format.

Home Lab setup


  • Dell XPS 8500
  • Intel i5
  • RAM + SSD upgrades from Crucial
  • Local Storage (1T)


  • Fedora Core 18 (base OS)
  • VirtualBox
  • Vagrant
  • DevStack
  • XenServer 6

I’m setting up a home lab to do some light coding on OpenStack and for testing implementations of next generation software/hardware deployment tools like BOSH and Razor.

the grep is a lie

grep is a wonderful tool for digging through logs on specific issues, but there are a few cases when people misuse it and claim the logs don’t have the answers when grep didn’t yield an answer.

Here’s an example of Rails application logging from Ruby on Rails Guides:

Processing PostsController#create (for at 2008-09-08 11:52:54) [POST]
  Parameters: {"commit"=>"Create", "post"=>{"title"=>"Debugging Rails",
 "body"=>"I'm learning how to print in logs!!!", "published"=>"0"},
 "authenticity_token"=>"2059c1286e93402e389127b1153204e0d1e275dd", "action"=>"create", "controller"=>"posts"}
New post: {"updated_at"=>nil, "title"=>"Debugging Rails", "body"=>"I'm learning how to print in logs!!!",
 "published"=>false, "created_at"=>nil}
Post should be valid: true
  Post Create (0.000443)   INSERT INTO "posts" ("updated_at", "title", "body", "published",
 "created_at") VALUES('2008-09-08 14:52:54', 'Debugging Rails',
 'I''m learning how to print in logs!!!', 'f', '2008-09-08 14:52:54')
The post was saved and now the user is going to be redirected...
Redirected to #
Completed in 0.01224 (81 reqs/sec) | DB: 0.00044 (3%) | 302 Found [http://localhost/posts]

Grepping for “learning” will give us just a peek but there’s much more information to be found in the full request.

# grep learning applog.log

 "body"=>"I'm learning how to print in logs!!!", "published"=>"0"},
New post: {"updated_at"=>nil, "title"=>"Debugging Rails", "body"=>"I'm learning how to print in logs!!!",
 'I''m learning how to print in logs!!!', 'f', '2008-09-08 14:52:54')

If you know the exact format of your applications log messages, you can use output context flags within grep (-A -B and -C). However, a lot of the time the exact number of context lines needed is unknown or a particular stack trace could have a varying length.

Rails applications aren’t the only ones – the logging module within Nova also falls to this same issue. Common Log Format seems to get around the problem, but many modern applications or ones in debug mode have multiline/transaction-ID logging which make sole reliance on grep a bad decision.

My preferred technique: Use grep to determine which log file to open in less. Then, use the pattern search within less that I grepped and take a look at the clues provided in the context. Sometimes it’s as simple as two lines later you’ll see a SIGTERM, but you wouldn’t have grepped for SIGTERM.

Another tip with less and pattern matching: if you have a large file and you know relative to the file, your search string is toward the bottom hit G to move less to the bottom of the file, then do your /pattern search, but then press N to find the previous result.

One last thing: if you haven’t given zless/zgrep a try on compressed files, they’re worth their weight in gold.

Deep Dive: OpenStack Nova Snapshot Image Creation with XenAPI/XenServer and Glance

Based on currently available code (nova: a77c0c50166aac04f0707af25946557fbd43ad44 2012-11-02/python-glanceclient: 16aafa728e4b8309b16bcc120b10bc20372883f4 2012-11-07/glance: 9dae32d60fc285d03fdb5586e3368d229485fdb4)

This is a deep dive into what happens (and where in the code) during image creation with a Nova/Glance configuration that is backed by XenServer/XenAPI.  Hopefully the methods used in Glance/Nova’s code will not change over time, and this guide will remain good starting point.

Disclaimer: I am _not_ a developer, and these are really just best guesses. Corrections are welcome.


1) nova-api receives an imaging request. The request is validated, checking for a name and making sure the request is within quotas. Instance data is retrieved, as well as block device mappings. If the instance is volume backed, a separate compute API call is made to snapshot (self.compute_api.snapshot_volume_backed). For this deep dive, we’ll assume there is no block device mapping. self.compute_api.snapshot is called. The newly created image UUID is returned.

  • nova/api/openstack/compute/
  • def _action_create_image

2) The compute API gets the request and calls _create_image.  The instance’s task state is set to IMAGE_SNAPSHOT. Notifications are created of the state change. Several properties are collected about the image, including the minimum RAM, customer, and base image ref.The non inheritable instance_system meta data is also collected. (2a, 2b, 2c) self.image_service.create and (3) self.compute_rpcapi.snapshot_instance are called.

  • nova/compute/
  • def snapshot
  • def _create_image

2a) The collected metadata from 2 is put into a glance-friendly format, and sent to glance. The glance client’s create is called.

  • nova/image/
  • def create

2b) Glance (client) sends a POST the glance server to /v1/images with the gathered image metadata from (3).

  • glanceclient/v1/
  • def create

2c) Glance (server) receives the POST. Per the code comments:

Upon a successful save of the image data and metadata, a response
containing metadata about the image is returned, including its
opaque identifier.

  • glance/api/v1
  • def create
  • def _handle_source

3) Compute RPC API casts a message to the queue for the instance’s compute node.

  • nova/compute/
  • def snapshot_instance

4) The instance’s power state is read and updated. (4a) The XenAPI driver’s snapshot() is called. Notification is created for the snapshot’s start and end.

  • nova/compute/
  • def snapshot_instance

4a) The vmops snapshot is called (4a1).

  • nova/virt/xenapi/
  • def snapshot

4a1) The snapshot is created in XenServer via (4a1i) vm_utils, and (4a1ii) uploaded to glance. The code’s comments say this:

Steps involved in a XenServer snapshot:

1. XAPI-Snapshot: Snapshotting the instance using XenAPI. This
creates: Snapshot (Template) VM, Snapshot VBD, Snapshot VDI,
Snapshot VHD
2. Wait-for-coalesce: The Snapshot VDI and Instance VDI both point to
a ‘base-copy’ VDI. The base_copy is immutable and may be chained
with other base_copies. If chained, the base_copies
coalesce together, so, we must wait for this coalescing to occur to
get a stable representation of the data on disk.
3. Push-to-glance: Once coalesced, we call a plugin on the XenServer
that will bundle the VHDs together and then push the bundle into

  • nova/virt/xenapi/
  • def snapshot

4a1i) The instance’s root disk is recorded and its VHD parent is also recorded. The SR is recorded. The instance’s root VDI is snapshotted. Operations are blocked until a coalesce completes in _wait_for_vhd_coalesce (4a1i-1).

  • nova/virt/xenapi/
  • def snapshot_attached_here

4a1i-1) The end result of this process is outlined in the code comments:

Before coalesce:

* original_parent_vhd
    * parent_vhd

After coalesce:

* parent_vhd

In (4a1i) the original vdi uuid was recorded. The SR is scanned. In a nutshell, the code is ensuring that the desired layout above is met before allowing the snapshot to continue. The code polls CONF.xenapi_vhd_coalesce_max_attempts times and sleeps CONF.xenapi_vhd_coalesce_poll_interval: the SR is scanned. The original_parent_uuid is compared to the parent_uuid… if they don’t match we wait a while and check again for the coalescing to complete.

  • nova/virt/xenapi/
  • def _wait_for_vhd_coalesce

4a1ii) The glance API servers are retrieved from configuration. The glance upload_vhd XenAPI plugin is called.

  • nova/virt/xenapi/
  • def upload_image

4a2) A staging area is created, prepared, and _upload_tarball is called.

  • plugins/xenserver/xenapi/etc/xapi.d/plugins/glance
  • def upload_vhd

4a3) The staging area is prepared. This basically symlinks the snapshot VHDs to a temporary folder in the SR.

  • plugins/xenserver/xenapi/etc/xapi.d/plugins/
  • def prepare_staging_area

4a4) The comments say it best:

Create a tarball of the image and then stream that into Glance
using chunked-transfer-encoded HTTP.

A URL is constructed and a connection is opened to it. The image meta properties (like status) are collected and added as HTTP headers. The tarball is created, and streamed to glance in CHUNK_SIZE increments.  The HTTP stream is terminated, the connection checks for an OK from glance and reports accordingly.

  • plugins/xenserver/xenapi/etc/xapi.d/plugins/glance
  • def _upload_tarball

(Glance Server)

5) I’ve removed some of the obvious URL routing functions in glance to get down to the meat of this process. Basically, the PUT request goes to glance API.  The API interacts with the registry again, but this time there is data to be uploaded.  The image’s metadata is validated for activation, and then _upload_and_activate is called. _upload_and_activate is basically a way to call _upload and ensure that if it works, activate the image.  _upload checks to see if we’re copying, but we’re not. It also checks to see if the HTTP request is application/octet-stream. Then, an object store like swift is inferred from the request or used from the glance configuration (self.get_store_or_400). Finally, the image is added to the object store and its checksum is verified and the glance registry is updated. Notifications are also sent for image.upload.

  • glance/api/v1/
  • def update
  • def _handle_source
  • def _upload_and_activate
  • def _upload