Posts by andyhillky

I'm cool.

Book Review: Mastering Ansible

While debugging an Ansible playbook, I tweeted about variable precedence:

Through that tweet I learned how to spell mnemonic AND I got a recommendation to buy Mastering Ansible.  This blog post is a quick review of Mastering Ansible.

Overall, I enjoyed the book’s format more than anything else. The flow of objective/code/screenshot really made the concepts gel (using herp/derp in examples is a close second!). I’m an advanced Ansible user (over 3 years), and this book still showed me new parts of Ansible. The book had very useful suggestions for playbook optimization and simplification. Mastering Ansible also provided some nifty tips and tricks like Jinja Macros, ansible-vault with a script, and merging hashes that I didn’t know were possible.

Of course the book answered my variable precedence question – another thing it did was provide some clarity around using variables with includes which was what I really needed.

If you’re just getting your feet wet with Ansible or you’ve been using it for years, Mastering Ansible is worth picking up.

F5 Rolling Deployments with Ansible

Rolling deployments is well covered in the Ansible Docs. This post is about using those rolling deployments patterns but with F5 Load Balancers.  These techniques use the F5 modules require BIG-IP software version >= 11 and have been tested with Ansible 1.9.4.

There are a couple of things to examine before doing one of these deployments:

  • Preflight Checks
  • Ansible’s pre_tasks / post_tasks

Preflight Checks

If you’re using a pair of F5s for High Availability, consider having a preflight check play to ensure your configurations are in sync and determine your primary F5:


- name: Collect BIG-IP facts
  local_action: >
    bigip_facts
      server={{ item }}
      user={{ f5_user }}
      password={{ f5_password }}
      include=device_group,device
      with_items: f5_hosts

- name: determine primary F5
  set_fact: primary_lb="{{ device[item]['management_address'] }}"
  when: device[item]['failover_state'] == 'HA_STATE_ACTIVE'
  with_items: device.keys()

- name: fail when not in sync
  fail: msg="not proceeding, f5s aren't in sync"
  when: device_group['/Common/device_trust_group']['device']|length > 1 and
device_group['/Common/device_trust_group']['sync_status']['status'] != 'In Sync'

Recommendation: Use Automatic Sync for doing automated operations against a pair of F5s. I’ve listed the failsafe here for the sake of completeness.

Beware: the bigip_facts module sets facts named “device_group, device” or any other parameter you pass to ‘include’.

pre_tasks & post_tasks

The pre_tasks section looks similar- disable/drain. There’s a new variable introduced- more on that at the end of this post.

pre_tasks:
  - name: disable node
    local_action: >
      bigip_node: server={{ primary_lb }}
                  user={{ f5_user }}
                  password={{ f5_pass }}
                  name={{ inventory_hostname }}
                  state=disabled

   - name: wait for connections to drain
     wait_for: host={{ inventory_hostname }}
               state=drained
               exclude_hosts={{ ','.join(heartbeat_ips) }}
               port=443

The post_tasks section isn’t much different. No new variables to worry about, just resume connections and re-enable the node.

post_tasks:
   - name: wait for connections to resume
     wait_for: host={{ inventory_hostname }}
               state=started
               port=443

  - name: enable node
    local_action: >
      bigip_node: server={{ primary_lb }}
                  user={{ f5_user }}
                  password={{ f5_pass }}
                  name={{ inventory_hostname }}
                  state=enabled

A note about heartbeat_ips:

heartbeat_ips. These are the source IP addresses that the F5 uses to health check your node. heartbeat_ips can be discovered on each F5 by navigating to Network -> Self IPs.  If you don’t know how to find them, disable your node in the F5 and run something like tcpdump -i eth0 port 443. The heartbeat IPs can be discovered by adding “self_ip” to the bigip_facts module’s include parameter – usually represented like this:

hostvars['localhost']['self_ip']['/Common/Self-IPv4']['address']

There are two downsides discovering heartbeat IPs:

1) They have to be gathered from each F5 and concatenated together. The way the bigip_facts module operates this requires use of a register. (it gets messy)

exclude_hosts={{ f5_facts.results|join(',', attribute='ansible_facts.self_ip./Common/internal-self.address') }}

2) The bigip_facts gathering module is pretty time consuming- I’ve seen up to 30s per F5 depending on the amount of data returned. Storing these IPs elsewhere is much faster =)

Giving FlowBAT a try

FlowBAT is a graphical flow based analysis tool.

FlowBAT is a very snappy UI which wraps flow collection software called SiLK. My interests with FlowBAT and SiLK are around using Open vSwitch to push sFlow data to SiLK and FlowBAT to quickly query that data. Hopefully I’ll get around to posting about OVS, sFlow, SiLK and FlowBAT.

While I was getting started with FlowBAT (and SiLK) I saw there weren’t any ansible roles for either component. The FlowBAT team maintains some shell scripts to install and configure FlowBAT and the SiLK project has SiLK on a box. I ported both of these projects to ansible roles and posted them to ansible-galaxy.

To give them both a try in a Vagrant environment, check out vagrant-flowbat.

Remediation as a Service

I’ve seen a couple of automated remediation tools get some coverage lately:

And both have received interesting threads on Hacker News.

One HN commenter that stood out (bigdubs):

I don’t like being the voice of the purist in this, but this seems like a bandaid on a bullet wound.

For most of the cases where this would seem to be useful there is probably a failure upstream of that usage that should really be fixed.

The system’s operators have to keep the system up and running. If the bug can’t be diagnosed in a sane amount of time but there’s a clear remediation, the right choice is to get the system up and running.

A lot of operations teams employ this method: If Time to Diagnose > Time to Resolve Then Workaround. This keeps the site up and running, but covering a system in bandaids will lead to more bandaids and their complexity will increase.

The commenter has a point – if everything’s band-aided, the system’s behavior is unpredictable.  Another bad thing about automated remediation is that they can further the divide between developers and operators. With sufficient automated remediation, in many cases operators don’t have to involve developers for fixes. This is bad.

Good reasons to do automated remediation:

  • Not all failures can be attributed to and fixed with software (broken NIC, hard drive, CPU fan)
  • Most companies do not control their technology stack end to end; this also means shipping your own device driver or modified Linux kernel
  • For the technology stack that is under a company’s control, the time to deploy a bugfix is greater than the time to deploy an understood temporary workaround

There are opportunities for improving operations (and operators lives) by employing some sort of automated remediation solution, but it is not a panacea. Some tips for finding a sane balance with automated remediation:

  • Associate issues from an issue tracker and quantify the number of times each issue’s workaround is executed – no issue tracker, no automated remediation
  • Give the in-house and vendor software related workarounds a TTL
  • Mark the workarounds as diagnosed/undiagnosed, and invest effort in diagnosing the root cause of the workaround
  • Make gathering diagnostics the first part of the workaround

Finally – consider making investments in technology which enable useful diagnostic data to be gathered quickly and non disruptively. One example of this is OpenStack Nova’s Guru Mediation reports.

Troubleshooting LLDP

why

LLDP is a wonderful protocol which paints a picture of datacenter topology. lldpd is a daemon to run on your servers to receive LLDP frames outputs network location and more.  There’s also a recently patched lldp Ansible module.

Like all tools, using LLDP/lldpd has had some issues. Here’s the ones I’ve seen in practice, with diagnosis and resolution:

Switch isn’t configured to send LLDP frames

Diagnosing:

tcpdump -i eth0 -s 1500 -XX -c 1 'ether proto 0x88cc'

Switches will send the LLDP frames by default every 30s. The switchport’s configuration needs to enable LLDP.

Host isn’t reporting LLDP frames

Generally, this means lldpd isn’t running on the server. If the lldp frames are arriving (from the above tcpdump), but lldpctl will returns nothing.

Diagnosing:

lldpctl # returns nothing
pgrep -f lldpd # returns nothing
service lldpd restart

Be sure that the lldpd service is set to run at boot and take a look at configuration options.

NIC is dropping LLDP frame

By far the most frustrating- NIC firmware issues which can cause the NIC to drop lldp frames. (Page 10, item 14)

The way this one manifests:

  • lldpctl reports nothing
  • lldpd is running
  • switch is configured to send LLDP frames

Diagnosing:

Run a packet capture on the switch to ensure that the LLDP frames are being sent to the port. If you’re able to see the frame go out on the wire and traffic is otherwise functioning normally to the host, the problem lies with the NIC.

The fix here was to apply the NIC firmware upgrade- after that, lldp was good to go!