On The Big Rewrite

Content disappears from time to time on the web. This post is my own mirror of John Rauser’s Twitter thread that I frequently reference.

Inspired by this HN comment https://news.ycombinator.com/item?id=11554288, I offer a story about software rewrites and Bezos as a technical leader.

I was once in an annual planning meeting at Amazon. In attendance was Jeff, every SVP at amazon, my VP and all the directors in my org.

One of the things we were proposing was a rip-and-replace rewrite of a relatively minor internal system.

As we were discussing other things, Jeff quietly got up and left the room.

This, in and of itself, was not a big deal. We all probably thought he just had to go to the bathroom.

He came back and rejoined the discussion. A few minutes later, an assistant came in and handed Jeff a small stack of papers.

Jeff brought up the proposed rewrite as he handed out the papers. They were copies of this http://ecolo.org/documents/documents_in_english/Rickover.pdf

We all read silently. Go ahead and read it now. http://ecolo.org/documents/documents_in_english/Rickover.pdf

No really, read it. It’s the entire point of the story: http://ecolo.org/documents/documents_in_english/Rickover.pdf

When we were done he said (paraphrased): “If you drop this rewrite we need never talk about this again.”

“If you press on,” he continued, “you need to come back to me with a more complete justification and plan.”

We did not proceed with the rewrite.


OK Zoomer: My Zoom Pro Tips

I’ve been using Zoom on a daily basis for a while now. If you’re already a seasoned video conference user, these may help! These pro tips are focused on Mac OS use of Zoom, other platforms have similar functionality.

Gallery View

The default view in Zoom is called “Speaker View”. This keeps your view focused on whoever is speaking. Gallery View shows all participants at once. I prefer Gallery View for brief, informal team meetings.

In order to use Gallery View, you have to go to “Side by Side Mode” first. Then, you’ll see a button labeled “Speaker View” that toggles between Speaker View and Gallery View in the top right hand corner.

Speaker view, image taken from Zoom’s documentation
Gallery view, image taken from Zoom’s documentation

Keyboard Shortcuts

Judicious use of mute is essential for effective use of video conferencing. Don’t be the person who doesn’t mute!

The only two shortcuts I use on a regular basis (Mac):

  • Command(⌘)+Shift+V: Start/stop video
  • Command(⌘)+Shift+A: Mute/unmute audio

Zoom’s website has a comprehensive list of keyboard shortcuts.


For the vain users, Zoom has a snapchat-esque filter for touching up your appearance.

Zoom’s claim:

The ​​Touch Up My Appearance option retouches your video display with a soft focus. This can help smooth out the skin tone on your face, to present a more polished looking appearance when you display your video to others.

Touch Up My Appearance  (Zoom Documentation)
Settings -> Video Touch up My Appearance, image taken from Zoom’s documentation

TLS redirect security

A common technique to help TLS migrations is providing a redirect. For example, this blog, hosted on WordPress.com, redirects all HTTP requests on port 80 to one using TLS on port 443.

$ curl -v https://virtualandy.wordpress.com
* Rebuilt URL to: https://virtualandy.wordpress.com/
*   Trying
* Connected to virtualandy.wordpress.com ( port 80 (#0)
> GET / HTTP/1.1
> Host: virtualandy.wordpress.com
> User-Agent: curl/7.54.0
> Accept: */*
> Referer:
< HTTP/1.1 301 Moved Permanently
< Server: nginx
< Date: Mon, 10 Feb 2020 15:15:43 GMT
< Content-Type: text/html
< Content-Length: 162
< Connection: keep-alive
< Location: https://virtualandy.wordpress.com/

This looks innocuous. The client only received a 301 redirect. However, the entire HTTP request is sent over the wire in plaintext before the redirect is received by the client.

URIs are leaked in plaintext. Even Basic Access Authentication headers are also leaked. It’s true!

Wireshark screenshot where basic authorization credentials are visible over the wire (login:password)


Any security decision has trade offs. In most cases, like what WordPress.com provides, redirecting a non-TLS connection to TLS is the right decision.

In brownfield environment, customers may have bookmarks, search engines may have indexed the non-TLS address, and difficult to update code could be written to follow URLs that were stored in datastores. Not using TLS redirects can make your customers suffer.

In greenfield environment, only accept TLS and don’t offer HTTP redirects.

Home Office: January 2020

I spent the majority of the 2010s working from a home office.

Sometimes folks ask about my setup, so I thought it was time to show it all in a post. This post does not contain any affiliate links.

Desk in standing position


Video Conferencing

Microphone: Samson Meteor (2014)

Camera: Logitech HD Portable 1080p Webcam C615 with Autofocus (2014)

Light: Ring Light Webcam Mount (2019). I added a light after changing to a lower light room and reading Scott Hanselman’s Good, Better, Best post.


Why 3 pairs of headphones? I don’t have a great answer! I mainly use the Apple AirPods for phone conversations and walks, the AudioTechnicas for computer audio (listening on VC), and the Sony MX3s for music from my phone.

Computer Essentials

Monitor: Dell UltraSharp 30 Monitor with PremierColor: UP3017 (2017)

Laptop: Apple MacBook Pro 13-inch, 2016, Four Thunderbolt 3 Ports, 16 GB 2133 MHz LPDDR3 3.3 GHz Intel Core i7 (2017). Well-known keyboard issues aside, this laptop has been outstanding. Most days I never touch the laptop’s keyboard.

Keyboard: Microsoft Natural Ergonomic Keyboard 4000 v1.0 (2016).

Mouse: Microsoft Classic Intellimouse (2017). Overall, I’m happy with this mouse. Two nitpicks: 1) interior thumb button support on MacOS was lacking 2) there’s an LED underneath that gets unnecessarily bright.

USB Hub: Anker® USB 3.0 7-Port Hub with 36W Power Adapter [12V 3A High-Capacity Power Supply and VIA VL812 Chipset] (2014)

Everything Else


Desk: Jarvis Adjustable Height Desk, 72″ x 30 (2015). Jarvis was acquired by Fully. My single experience with Fully, requesting replacement grommets, was a positive one.

Standing Mat: Ergodriven Topo Mat (2017).

Chair: Herman Miller Cosm, graphite, tall back with adjustable arms (2019). Purchased at Design Within Reach in Seattle last year. The store had all of the Herman Miller models displayed and the salesperson was excellent to work with. The chair is crazy expensive! I’m considering it a long term investment.


Laptop Stand: Rain Design mStand (2017).

Sleeve: Cable Management Sleeve, JOTO Cord Management System for TV / Computer / Home Entertainment, 19 – 20 inch Flexible Cable Sleeve Wrap Cover Organizer, 4 Piece – Black (2017)

Under desk clips: 25pcs Adhesive Cable Clips Wire Clips Cable Wire Management Wire Cable Holder Clamps Cable Tie Holder for Car Office and Home (2017)

Headphones: Brainwavz Henja Headphones Hanger (2017)

Power Strip: Tripp Lite 6 Outlet Surge Protector Power Strip Clamp Mount (2017)

Notebook: National Brand Laboratory Notebook (2017).

Desk and chair in sitting position.

Office Ergonomics Review

Recently, I had a fair amount of back / neck pain. This surprised me since at $PREVJOB I was measured by a certified ergonomist and given measurements that were suitable for that setup.

I realized that it’d been several years since I was at $PREVJOB, and my desk and its contents have changed considerably. It was time to remeasure!

Some sites that helped me regain some comfort:

Ergonomic Office: Calculate optimal height of Desk, Chair / Standing Desk: This site has loads of graphics, calculators and details setting up a traditional desk as well as a standing desk.

Standing Height Calculator: While the Ergonomic Office site had recommendations for standing desk height, they differed from this site’s. This site had a better desk height calculation for me.

CBS News: Standing desk dilemma: Too much time on your feet? This article has lots of standing desk specific information that was useful about how to stand (and how not to stand). (via r/standingdesk)

The most surprising change for me was the 20 degree monitor adjustment that a few of the sites mention. This is a bit more important for folks that wear specific types of glasses like I do.

Finally, don’t forget about ergonomics in the car. During the holidays I had a lot of back pain but I wasn’t at my desk… I was driving a lot more than normal. I found this illustration by Lee Sullivan to be useful:

A Fix for OperationalError: (psycopg2.OperationalError) SSL error: decryption failed or bad record mac

I kept seeing this error message with a uwsgi/Flask/SQLAlchemy application:

[2019-09-01 23:43:34,378] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "./http.py", line 11, in foo
    session.execute('SELECT 1')
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1269, in execute
    clause, params or {}
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 988, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1466, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
    cursor.execute(statement, parameters)
OperationalError: (psycopg2.OperationalError) SSL error: decryption failed or bad record mac

This issue was difficult to figure out. In fact, I was annoyed enough I made a simple reproduction repo. Searches of StackOverflow did not offer great solutions. The closest I found was the following post:

uWSGI, Flask, sqlalchemy, and postgres: SSL error: decryption failed or bad record mac

The accepted solution suggests using lazy-apps = true in uwsgi.ini. This isn’t a great solution because of the increased memory use. When I enabled this on my application, it slowed down noticeably, but the problem was gone!

The post gave me a clue about the nature of the problem. When uwsgi starts, the python application is imported, then N forks (called workers in uwsgi) of the application are copied (N = processes in uwsgi.ini). This post from Ticketea engineering illustrated this nicely.

Most of the time this is OK! My application’s particular use of psycopg2 was problematic. On the initial application startup, a database connection/engine was created with SQLalchemy. That same connection was copied to the uwsgi worker processes. Psycopg documentation calls out uses like this in its documentation:

The Psycopg module and the connection objects are thread-safe: many threads can access the same database either using separate sessions and creating a connection per thread or using the same connection and creating separate cursors. In DB API 2.0 parlance, Psycopg is level 2 thread safe.

The documentation continues, with the important point:

The above observations are only valid for regular threads: they don’t apply to forked processes nor to green threads. libpq connections shouldn’t be used by a forked processes, so when using a module such as multiprocessing or a forking web deploy method such as FastCGI make sure to create the connections after the fork.

A ha! My connection that was created on application import/startup needed to be torn down! This is an easy enough fix: calling engine.dispose after I’ve used the import/startup time connection.

python-dateutil gotcha

A quick note: Don’t naively use dateutil.parser.parse() if you know your dates are ISO8601 formatted. Use dateutil.parser.isoparse() instead.

In [1]: from dateutil.parser import *
In [2]: datestr = '2016-11-10 05:11:03.824266'
In [3]: %timeit parse(datestr)
The slowest run took 4.78 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 211 µs per loop
In [4]: %timeit isoparse(datestr)
100000 loops, best of 3: 17.1 µs per loop

On Lifelong Learning

One of my goals is lifelong learning.

Since finishing school over a decade ago, most of my learning has taken place on the job or through reading technical books in the last decade.  I often feel that books provide great surveys and details on theory (which are important!) but they’re light on useful information for practitioners.


Coursera was the first experience I had with online classes. I’ve done three courses on Coursera: Computer Networks (available here), Algorithms Part I, and Algorithms Part II.  These are college courses that are available online. The Coursera experience worked well for me by introducing broad, completely new concepts with world class lecturers. For me, there wasn’t much practical value in the assignments/exercises because of the time commitment required.

Coursera is great to introduce a broad new set of topics and most importantly, learn how to ask questions about the subject.

I didn’t attend a top-tier engineering school, but online courses helped me get through interview processes at “big” tech firms.


YouTube and Twitch are amazing resources for learning new skills. Some of the  GDB debugging videos are what I’ve been watching lately!


Our local public library partners with lynda.com- the courses there are free. This is a tremendous, low-risk opportunity to take guided tours of new technologies!

See previous learning posts from 2009, 2012.

Book Review: Designing Data-Intensive Applications


Some days I feel that I’ve returned to my old DBA role, well… sort of. A lot of my day to day work revolves around a vast amount of data and applications for data storage and retrieval. I haven’t been a DBA by trade since 2011, nearly an eternity in the tech industry.

To me, Designing Data-Intensive Applications was a wonderful survey of “What’s changed?” as well as “What’s still useful?” regarding all things data. I found Designing had practical, implementable material, especially around newer document databases.

This book is a good balance of theory and practice. I particularly enjoyed the breakdown of concurrency related problems and adding concrete names to classes of issues I’ve seen in practice (e.g., phantom reads, dirty reads, etc).

The section about distributed systems theory was well done but it was a definite lull in the rest of the book’s flow.

The book’s last section and last chapter were both strong. I would recommend this book to all of my colleagues.

Labels on Cloud Dataflow Instances for Cloud BigTable Exports

Google Cloud Platform has the notion of resource labels which are often useful in tracking ownership and cost reporting.

Google Cloud BigTable supports data exports as sequence files. The process uses Cloud Dataflow. Cloud Dataflow ends up spinning up Compute Engine VMs for this processing, up to –maxNumWorkers.

I wanted to see what this was costing to run on a regular basis, but the VMs are ephemeral in nature and unlabelled. There’s an option not mentioned on the Google documentation to accomplish this task!

$ java -jar bigtable-beam-import-1.4.0-shaded.jar export --help=org.apache.beam.runners.dataflow.options.DataflowPipelineOptions
 Options that configure the Dataflow pipeline.

 Labels that will be applied to the billing records for this job.


Looking at PostGIS intersects queries

I’ve been looking hard at datastores recently, in particular PostgreSQL + PostGIS. In school, I did a few GIS related things in class and I assumed it was magic. It turns out there is no magic.

At work, we have an application that accepts polygons and returns objects whose location are within that polygon. In GIS-speak this is often called an Area of Interest or AOI and they’re often represented as a few data types: GeoJSON, Well-known Text and Well-known Binary (WKT/WKB).

So what happens when the request is accepted and how does PostGIS make this all work?

Here’s a graphical representation of my GeoJSON (the ‘raw’ link should show the GeoJSON document):

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw


hosted with ❤ by GitHub

And the same as WKT:

POLYGON ((-88.35548400878906 36.57362961247927, -88.24562072753906 36.57362961247927, -88.24562072753906 36.630406976495614, -88.35548400878906 36.630406976495614, -88.35548400878906 36.57362961247927))

view raw


hosted with ❤ by GitHub

In the PostGIS world, geospatial data is stored as a geometry type.

Here’s a query returning WKT->Geometry:

db # SELECT ST_GeometryFromText('POLYGON ((-88.35548400878906 36.57362961247927, -88.24562072753906 36.57362961247927, -88.24562072753906 36.630406976495614, -88.35548400878906 36.630406976495614, -88.35548400878906 36.57362961247927))
-[ RECORD 1 ]——-+——————————————————————————————————————————————————————————————-
st_geometryfromtext | 0103000000010000000500000000000040C01656C0CDCEF4B16C49424000000040B80F56C0CDCEF4B16C49424000000040B80F56C0059C012DB150424000000040C01656C0059C012DB150424000000040C01656C0CDCEF4B16C494240

view raw


hosted with ❤ by GitHub

The PostGIS geometry datatype is an lwgeom struct.

So, to find all of the objects in a given AOI, the SQL query would look something like this (PostGIS FAQ):

SELECT id FROM objects WHERE geom && ST_GeometryFromText('POLYGON ((-88.35548400878906 36.57362961247927, -88.24562072753906 36.57362961247927, -88.24562072753906 36.630406976495614, -88.35548400878906 36.630406976495614, -88.35548400878906 36.57362961247927))')

OK, but what does the && operator do? Thanks to this awesome dba.stackexchange answer, we finally get some answers! In a nutshell:

  • && is short for geometry_overlaps
  • geometry_overlaps passes through several layers of abstraction (the answer did not show them all)
  • in the end, floating point comparisons see if the object in the database’s polygon is within the requested polygon


GPG Suite and gpg-agent forwarding

I had to fight a little with GPG Suite (Mac) and forwarding gpg-agent to a Ubuntu 16.04 target. This post describes what ended up working for me.

Source Machine:

  • macOS Sierra 10.12.6
  • GPG Suite 2017.1 (2002)

Target Machine:

  • gpg2 installed on Ubuntu 16.04 (sudo apt-get install gpg2 -y)

Source machine:

File: ~/.gnupg/gpg-agent.conf

Add this lines:

extra-socket /Users/[user]/.gnupg/S.gpg-agent.remote

File: ~/.ssh/config

Add these lines:

RemoteForward /home/[user]/.gnupg/S.gpg-agent /Users/[user]/.gnupg/S.gpg-agent.remote
ExitOnForwardFailure Yes

Restart gpg-agent:

gpgconf --kill gpg-agent
gpgconf --launch gpg-agent

Destination Machine:

File: /etc/ssh/sshd_config

Add this line:

StreamLocalBindUnlink yes

Restart sshd:

sudo systemctl restart sshd


This article helped a lot while I was troubleshooting.

Book Review: Mastering Ansible

While debugging an Ansible playbook, I tweeted about variable precedence:


Through that tweet I learned how to spell mnemonic AND I got a recommendation to buy Mastering Ansible.  This blog post is a quick review of Mastering Ansible.

Overall, I enjoyed the book’s format more than anything else. The flow of objective/code/screenshot really made the concepts gel (using herp/derp in examples is a close second!). I’m an advanced Ansible user (over 3 years), and this book still showed me new parts of Ansible. The book had very useful suggestions for playbook optimization and simplification. Mastering Ansible also provided some nifty tips and tricks like Jinja Macros, ansible-vault with a script, and merging hashes that I didn’t know were possible.

Of course the book answered my variable precedence question – another thing it did was provide some clarity around using variables with includes which was what I really needed.

If you’re just getting your feet wet with Ansible or you’ve been using it for years, Mastering Ansible is worth picking up.

F5 Rolling Deployments with Ansible

Rolling deployments is well covered in the Ansible Docs. This post is about using those rolling deployments patterns but with F5 Load Balancers.  These techniques use the F5 modules require BIG-IP software version >= 11 and have been tested with Ansible 1.9.4.

There are a couple of things to examine before doing one of these deployments:

  • Preflight Checks
  • Ansible’s pre_tasks / post_tasks

Preflight Checks

If you’re using a pair of F5s for High Availability, consider having a preflight check play to ensure your configurations are in sync and determine your primary F5:

- name: Collect BIG-IP facts
  local_action: >
      server={{ item }}
      user={{ f5_user }}
      password={{ f5_password }}
      with_items: f5_hosts

- name: determine primary F5
  set_fact: primary_lb="{{ device[item]['management_address'] }}"
  when: device[item]['failover_state'] == 'HA_STATE_ACTIVE'
  with_items: device.keys()

- name: fail when not in sync
  fail: msg="not proceeding, f5s aren't in sync"
  when: device_group['/Common/device_trust_group']['device']|length > 1 and
device_group['/Common/device_trust_group']['sync_status']['status'] != 'In Sync'

Recommendation: Use Automatic Sync for doing automated operations against a pair of F5s. I’ve listed the failsafe here for the sake of completeness.

Beware: the bigip_facts module sets facts named “device_group, device” or any other parameter you pass to ‘include’.

pre_tasks & post_tasks

The pre_tasks section looks similar- disable/drain. There’s a new variable introduced- more on that at the end of this post.

  - name: disable node
    local_action: >
      bigip_node: server={{ primary_lb }}
                  user={{ f5_user }}
                  password={{ f5_pass }}
                  name={{ inventory_hostname }}

   - name: wait for connections to drain
     wait_for: host={{ inventory_hostname }}
               exclude_hosts={{ ','.join(heartbeat_ips) }}

The post_tasks section isn’t much different. No new variables to worry about, just resume connections and re-enable the node.

   - name: wait for connections to resume
     wait_for: host={{ inventory_hostname }}

  - name: enable node
    local_action: >
      bigip_node: server={{ primary_lb }}
                  user={{ f5_user }}
                  password={{ f5_pass }}
                  name={{ inventory_hostname }}

A note about heartbeat_ips:

heartbeat_ips. These are the source IP addresses that the F5 uses to health check your node. heartbeat_ips can be discovered on each F5 by navigating to Network -> Self IPs.  If you don’t know how to find them, disable your node in the F5 and run something like tcpdump -i eth0 port 443. The heartbeat IPs can be discovered by adding “self_ip” to the bigip_facts module’s include parameter – usually represented like this:


There are two downsides discovering heartbeat IPs:

1) They have to be gathered from each F5 and concatenated together. The way the bigip_facts module operates this requires use of a register. (it gets messy)

exclude_hosts={{ f5_facts.results|join(',', attribute='ansible_facts.self_ip./Common/internal-self.address') }}

2) The bigip_facts gathering module is pretty time consuming- I’ve seen up to 30s per F5 depending on the amount of data returned. Storing these IPs elsewhere is much faster =)

Giving FlowBAT a try

FlowBAT is a graphical flow based analysis tool.

FlowBAT is a very snappy UI which wraps flow collection software called SiLK. My interests with FlowBAT and SiLK are around using Open vSwitch to push sFlow data to SiLK and FlowBAT to quickly query that data. Hopefully I’ll get around to posting about OVS, sFlow, SiLK and FlowBAT.

While I was getting started with FlowBAT (and SiLK) I saw there weren’t any ansible roles for either component. The FlowBAT team maintains some shell scripts to install and configure FlowBAT and the SiLK project has SiLK on a box. I ported both of these projects to ansible roles and posted them to ansible-galaxy.

To give them both a try in a Vagrant environment, check out vagrant-flowbat.

Remediation as a Service

First aid kit  - Marcin Wichary - https://www.flickr.com/photos/mwichary/2615558474/in/photolist-4Z8qpL-7Z2Aon-61y7dV-73zHPJ-rAAn29-ftN7ov-7K7eX7-dkMGqP-dkMK3d-d6gZUf-d6gZrd-d6gY8U-d6h1kS-d6gYm5-d6gYSA-d6h1ym-dkMKWj-dkMKu7-dkMJGW-dkMGNx-dkMGur-dkMJXq-dkMGyH-dkMJM3-9uDPHq-6SGJBh-qLJeq-cW23T-dkMJEH-dkMJPs-7KHEAG-41rcH-nL2vEf-59FN23-dkMHb8-dkMHsD-dkMHjv-dkMK8q-dkMHfH-dkMHBM-dkMJPT-dkMHMV-dkMKMY-dkMKCY-dkMJeM-dkMLoQ-dkMJtR-dkMLC1-dkMJWz-e49WqS
First aid kit – Marcin Wichary

I’ve seen a couple of automated remediation tools get some coverage lately:

And both have received interesting threads on Hacker News.

One HN commenter that stood out (bigdubs):

I don’t like being the voice of the purist in this, but this seems like a bandaid on a bullet wound.

For most of the cases where this would seem to be useful there is probably a failure upstream of that usage that should really be fixed.

The system’s operators have to keep the system up and running. If the bug can’t be diagnosed in a sane amount of time but there’s a clear remediation, the right choice is to get the system up and running.

A lot of operations teams employ this method: If Time to Diagnose > Time to Resolve Then Workaround. This keeps the site up and running, but covering a system in bandaids will lead to more bandaids and their complexity will increase.

The commenter has a point – if everything’s band-aided, the system’s behavior is unpredictable.  Another bad thing about automated remediation is that they can further the divide between developers and operators. With sufficient automated remediation, in many cases operators don’t have to involve developers for fixes. This is bad.

Good reasons to do automated remediation:

  • Not all failures can be attributed to and fixed with software (broken NIC, hard drive, CPU fan)
  • Most companies do not control their technology stack end to end; this also means shipping your own device driver or modified Linux kernel
  • For the technology stack that is under a company’s control, the time to deploy a bugfix is greater than the time to deploy an understood temporary workaround

There are opportunities for improving operations (and operators lives) by employing some sort of automated remediation solution, but it is not a panacea. Some tips for finding a sane balance with automated remediation:

  • Associate issues from an issue tracker and quantify the number of times each issue’s workaround is executed – no issue tracker, no automated remediation
  • Give the in-house and vendor software related workarounds a TTL
  • Mark the workarounds as diagnosed/undiagnosed, and invest effort in diagnosing the root cause of the workaround
  • Make gathering diagnostics the first part of the workaround

Finally – consider making investments in technology which enable useful diagnostic data to be gathered quickly and non disruptively. One example of this is OpenStack Nova’s Guru Mediation reports.

Troubleshooting LLDP


LLDP is a wonderful protocol which paints a picture of datacenter topology. lldpd is a daemon to run on your servers to receive LLDP frames outputs network location and more.  There’s also a recently patched lldp Ansible module.

Like all tools, using LLDP/lldpd has had some issues. Here’s the ones I’ve seen in practice, with diagnosis and resolution:

Switch isn’t configured to send LLDP frames


tcpdump -i eth0 -s 1500 -XX -c 1 'ether proto 0x88cc'

Switches will send the LLDP frames by default every 30s. The switchport’s configuration needs to enable LLDP.

Host isn’t reporting LLDP frames

Generally, this means lldpd isn’t running on the server. If the lldp frames are arriving (from the above tcpdump), but lldpctl will returns nothing.


lldpctl # returns nothing
pgrep -f lldpd # returns nothing
service lldpd restart

Be sure that the lldpd service is set to run at boot and take a look at configuration options.

NIC is dropping LLDP frame

By far the most frustrating- NIC firmware issues which can cause the NIC to drop lldp frames. (Page 10, item 14)

The way this one manifests:

  • lldpctl reports nothing
  • lldpd is running
  • switch is configured to send LLDP frames


Run a packet capture on the switch to ensure that the LLDP frames are being sent to the port. If you’re able to see the frame go out on the wire and traffic is otherwise functioning normally to the host, the problem lies with the NIC.

The fix here was to apply the NIC firmware upgrade- after that, lldp was good to go!