Book Review: Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations

Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. I feel like Accelerate was an OK book. I picked this book up after seeing it recommended from Charity’s blog. I was hopeful for some mind blowing content, but it mostly provided confirmation of some truths I’ve encountered in industry a decade ago.

What I liked: Research-backed conclusions about engineering methodologies. The book provides a reader some vocabulary, techniques, research, and links to business outcomes. If delivering software feels Sisyphean for you or your team, this book can better equip you to try and ameliorate that situation.

What I didn’t like: All of the exhaustive research methodology details. Yes, this matters, especially when making new claims. I did not feel like this fit into the book’s subtitle of “building and scaling high performing technology organizations”.

Overall: There are valid criticisms of the actual research methodologies of the DORA report, which powers a fair amount of the book. Your mileage may vary!

Remediation as a Service

First aid kit  - Marcin Wichary - https://www.flickr.com/photos/mwichary/2615558474/in/photolist-4Z8qpL-7Z2Aon-61y7dV-73zHPJ-rAAn29-ftN7ov-7K7eX7-dkMGqP-dkMK3d-d6gZUf-d6gZrd-d6gY8U-d6h1kS-d6gYm5-d6gYSA-d6h1ym-dkMKWj-dkMKu7-dkMJGW-dkMGNx-dkMGur-dkMJXq-dkMGyH-dkMJM3-9uDPHq-6SGJBh-qLJeq-cW23T-dkMJEH-dkMJPs-7KHEAG-41rcH-nL2vEf-59FN23-dkMHb8-dkMHsD-dkMHjv-dkMK8q-dkMHfH-dkMHBM-dkMJPT-dkMHMV-dkMKMY-dkMKCY-dkMJeM-dkMLoQ-dkMJtR-dkMLC1-dkMJWz-e49WqS
First aid kit – Marcin Wichary

I’ve seen a couple of automated remediation tools get some coverage lately:

And both have received interesting threads on Hacker News.

One HN commenter that stood out (bigdubs):

I don’t like being the voice of the purist in this, but this seems like a bandaid on a bullet wound.

For most of the cases where this would seem to be useful there is probably a failure upstream of that usage that should really be fixed.

The system’s operators have to keep the system up and running. If the bug can’t be diagnosed in a sane amount of time but there’s a clear remediation, the right choice is to get the system up and running.

A lot of operations teams employ this method: If Time to Diagnose > Time to Resolve Then Workaround. This keeps the site up and running, but covering a system in bandaids will lead to more bandaids and their complexity will increase.

The commenter has a point – if everything’s band-aided, the system’s behavior is unpredictable.  Another bad thing about automated remediation is that they can further the divide between developers and operators. With sufficient automated remediation, in many cases operators don’t have to involve developers for fixes. This is bad.

Good reasons to do automated remediation:

  • Not all failures can be attributed to and fixed with software (broken NIC, hard drive, CPU fan)
  • Most companies do not control their technology stack end to end; this also means shipping your own device driver or modified Linux kernel
  • For the technology stack that is under a company’s control, the time to deploy a bugfix is greater than the time to deploy an understood temporary workaround

There are opportunities for improving operations (and operators lives) by employing some sort of automated remediation solution, but it is not a panacea. Some tips for finding a sane balance with automated remediation:

  • Associate issues from an issue tracker and quantify the number of times each issue’s workaround is executed – no issue tracker, no automated remediation
  • Give the in-house and vendor software related workarounds a TTL
  • Mark the workarounds as diagnosed/undiagnosed, and invest effort in diagnosing the root cause of the workaround
  • Make gathering diagnostics the first part of the workaround

Finally – consider making investments in technology which enable useful diagnostic data to be gathered quickly and non disruptively. One example of this is OpenStack Nova’s Guru Mediation reports.

StatsD and multiple metrics

download

Measure all the things! Graphite & statsd are my weapons of choice. One set of metrics in particular that we wanted to measure are the various TCP stats, including TCP Retransmit rate. We crafted a Python script to send all of the metrics in a single UDP packet and hit a weird scenario.

The python script was all ready to roll except that StatsD was only logging one metric.  All of the metric packets were arriving at the StatsD instance, but only one was being processed.

Turns out this wasn’t always built into StatsD. It was added in 0.4.0 and exists in later versions. Upgrading StatsD fixes this problem.

Using Swift and logrotate

Ever have an exchange like this?

Q: What happened <insert very very long time ago> on this service?
A: We can’t keep logs on the server past 2 months.  Those logs are gone.

Just about every IaaS out there has an object store. Amazon offers S3 and OpenStack providers have Swift. Why not just point logrotate at one of those object stores?

That’s just what I’ve done with Swiftrotate. It’s a simple shell script to use with logrotate. Config samples and more are in the project’s README.

NOTE: It doesn’t make a lot of sense to use without using dateext in logrotate. A lot of setups don’t use dateext, so there’s a utility script to rename all of your files to a dateext format.