I’ve seen a couple of automated remediation tools get some coverage lately:
One HN commenter that stood out (bigdubs):
I don’t like being the voice of the purist in this, but this seems like a bandaid on a bullet wound.
For most of the cases where this would seem to be useful there is probably a failure upstream of that usage that should really be fixed.
The system’s operators have to keep the system up and running. If the bug can’t be diagnosed in a sane amount of time but there’s a clear remediation, the right choice is to get the system up and running.
A lot of operations teams employ this method: If Time to Diagnose > Time to Resolve Then Workaround. This keeps the site up and running, but covering a system in bandaids will lead to more bandaids and their complexity will increase.
The commenter has a point – if everything’s band-aided, the system’s behavior is unpredictable. Another bad thing about automated remediation is that they can further the divide between developers and operators. With sufficient automated remediation, in many cases operators don’t have to involve developers for fixes. This is bad.
Good reasons to do automated remediation:
- Not all failures can be attributed to and fixed with software (broken NIC, hard drive, CPU fan)
- Most companies do not control their technology stack end to end; this also means shipping your own device driver or modified Linux kernel
- For the technology stack that is under a company’s control, the time to deploy a bugfix is greater than the time to deploy an understood temporary workaround
There are opportunities for improving operations (and operators lives) by employing some sort of automated remediation solution, but it is not a panacea. Some tips for finding a sane balance with automated remediation:
- Associate issues from an issue tracker and quantify the number of times each issue’s workaround is executed – no issue tracker, no automated remediation
- Give the in-house and vendor software related workarounds a TTL
- Mark the workarounds as diagnosed/undiagnosed, and invest effort in diagnosing the root cause of the workaround
- Make gathering diagnostics the first part of the workaround
Finally – consider making investments in technology which enable useful diagnostic data to be gathered quickly and non disruptively. One example of this is OpenStack Nova’s Guru Mediation reports.