What I liked: Research-backed conclusions about engineering methodologies. The book provides a reader some vocabulary, techniques, research, and links to business outcomes. If delivering software feels Sisyphean for you or your team, this book can better equip you to try and ameliorate that situation.
What I didn’t like: All of the exhaustive research methodology details. Yes, this matters, especially when making new claims. I did not feel like this fit into the book’s subtitle of “building and scaling high performing technology organizations”.
Overall: There are valid criticisms of the actual research methodologies of the DORA report, which powers a fair amount of the book. Your mileage may vary!
I don’t like being the voice of the purist in this, but this seems like a bandaid on a bullet wound.
For most of the cases where this would seem to be useful there is probably a failure upstream of that usage that should really be fixed.
The system’s operators have to keep the system up and running. If the bug can’t be diagnosed in a sane amount of time but there’s a clear remediation, the right choice is to get the system up and running.
A lot of operations teams employ this method: If Time to Diagnose > Time to Resolve Then Workaround. This keeps the site up and running, but covering a system in bandaids will lead to more bandaids and their complexity will increase.
The commenter has a point – if everything’s band-aided, the system’s behavior is unpredictable. Another bad thing about automated remediation is that they can further the divide between developers and operators. With sufficient automated remediation, in many cases operators don’t have to involve developers for fixes. This is bad.
Good reasons to do automated remediation:
Not all failures can be attributed to and fixed with software (broken NIC, hard drive, CPU fan)
Most companies do not control their technology stack end to end; this also means shipping your own device driver or modified Linux kernel
For the technology stack that is under a company’s control, the time to deploy a bugfix is greater than the time to deploy an understood temporary workaround
There are opportunities for improving operations (and operators lives) by employing some sort of automated remediation solution, but it is not a panacea. Some tips for finding a sane balance with automated remediation:
Associate issues from an issue tracker and quantify the number of times each issue’s workaround is executed – no issue tracker, no automated remediation
Give the in-house and vendor software related workarounds a TTL
Mark the workarounds as diagnosed/undiagnosed, and invest effort in diagnosing the root cause of the workaround
Make gathering diagnostics the first part of the workaround
Finally – consider making investments in technology which enable useful diagnostic data to be gathered quickly and non disruptively. One example of this is OpenStack Nova’s Guru Mediation reports.
The python script was all ready to roll except that StatsD was only logging one metric. All of the metric packets were arriving at the StatsD instance, but only one was being processed.
Q: What happened <insert very very long time ago> on this service?
A: We can’t keep logs on the server past 2 months. Those logs are gone.
Just about every IaaS out there has an object store. Amazon offers S3 and OpenStack providers have Swift. Why not just point logrotate at one of those object stores?
That’s just what I’ve done with Swiftrotate. It’s a simple shell script to use with logrotate. Config samples and more are in the project’s README.
NOTE: It doesn’t make a lot of sense to use without using dateext in logrotate. A lot of setups don’t use dateext, so there’s a utility script to rename all of your files to a dateext format.