Remediation as a Service

I’ve seen a couple of automated remediation tools get some coverage lately:

And both have received interesting threads on Hacker News.

One HN commenter that stood out (bigdubs):

I don’t like being the voice of the purist in this, but this seems like a bandaid on a bullet wound.

For most of the cases where this would seem to be useful there is probably a failure upstream of that usage that should really be fixed.

The system’s operators have to keep the system up and running. If the bug can’t be diagnosed in a sane amount of time but there’s a clear remediation, the right choice is to get the system up and running.

A lot of operations teams employ this method: If Time to Diagnose > Time to Resolve Then Workaround. This keeps the site up and running, but covering a system in bandaids will lead to more bandaids and their complexity will increase.

The commenter has a point – if everything’s band-aided, the system’s behavior is unpredictable.  Another bad thing about automated remediation is that they can further the divide between developers and operators. With sufficient automated remediation, in many cases operators don’t have to involve developers for fixes. This is bad.

Good reasons to do automated remediation:

  • Not all failures can be attributed to and fixed with software (broken NIC, hard drive, CPU fan)
  • Most companies do not control their technology stack end to end; this also means shipping your own device driver or modified Linux kernel
  • For the technology stack that is under a company’s control, the time to deploy a bugfix is greater than the time to deploy an understood temporary workaround

There are opportunities for improving operations (and operators lives) by employing some sort of automated remediation solution, but it is not a panacea. Some tips for finding a sane balance with automated remediation:

  • Associate issues from an issue tracker and quantify the number of times each issue’s workaround is executed – no issue tracker, no automated remediation
  • Give the in-house and vendor software related workarounds a TTL
  • Mark the workarounds as diagnosed/undiagnosed, and invest effort in diagnosing the root cause of the workaround
  • Make gathering diagnostics the first part of the workaround

Finally – consider making investments in technology which enable useful diagnostic data to be gathered quickly and non disruptively. One example of this is OpenStack Nova’s Guru Mediation reports.

StatsD and multiple metrics

download

Measure all the things! Graphite & statsd are my weapons of choice. One set of metrics in particular that we wanted to measure are the various TCP stats, including TCP Retransmit rate. We crafted a Python script to send all of the metrics in a single UDP packet and hit a weird scenario.

The python script was all ready to roll except that StatsD was only logging one metric.  All of the metric packets were arriving at the StatsD instance, but only one was being processed.

Turns out this wasn’t always built into StatsD. It was added in 0.4.0 and exists in later versions. Upgrading StatsD fixes this problem.

Using Swift and logrotate

Ever have an exchange like this?

Q: What happened <insert very very long time ago> on this service?
A: We can’t keep logs on the server past 2 months.  Those logs are gone.

Just about every IaaS out there has an object store. Amazon offers S3 and OpenStack providers have Swift. Why not just point logrotate at one of those object stores?

That’s just what I’ve done with Swiftrotate. It’s a simple shell script to use with logrotate. Config samples and more are in the project’s README.

NOTE: It doesn’t make a lot of sense to use without using dateext in logrotate. A lot of setups don’t use dateext, so there’s a utility script to rename all of your files to a dateext format.