Ansible versus Ubiquiti EdgeSwitches

Why use Ansible to manage EdgeSwitches?

Ubiquiti’s EdgeSwitch line is a series of semi-affordable managed switches designed for providers. In hardware, they are actually identical to Ubiquiti’s UniFi switches (just in black chassis compared to white) and simply run EdgeSwitch OS. Where UniFi is Software Defined Networking (SDN) made easy assuming you are running a USG or UniFi APs in addition to the UniFi switches, EdgeSwitches are more of a traditional Layer 2 switch (with some Layer 3 features). Ubiquiti allows you to manage some functions of EdgeSwitches in their UNMS (now called UISP) cloud platform but if you are like me you may be used to “air-gapped” networks (networks that aren’t connected to the internet) or don’t like the idea of having another cloud service that you have to manage (Ubiquiti’s major data breach of 2020 aside).

Ansible is an open source automation platform from Red Hat which can be run on just about any major Linux distribution in addition to FreeBSD and MacOS. For example, I currently run Ansible on a CentOS 7 VM at work and a Ubuntu 18.04 server at home. Ansible can be used to manage Windows machines (through powershell commands), Linux machines and many network devices (access points, switches, routers, firewalls, etc). Ansible also features concept of idempotency, or the ability to recognize a state doesn’t need to be changed. This is handy as it allows for tasks to be skipped if they already meet the requirements however, in the case of most network equipment, these checks have to be written into the play.

Basic Principles of Ansible

In it’s most basic form, Ansible uses SSH to log into devices and run commands. These commands can be either adhoc using specific modules or can be a series of tasks in a pre-planned playbook. These playbooks are a human readable lists of tasks in a YAML format. The playbook also defines the inventory hosts that the play will run against and other connection details. For this post, Ansible plays will be run against an inventory group called “edgeswitch” which has various EdgeSwitches defined. In my case, all of my EdgeSwitches that I am upgrading are running 1.8.x firmware releases (and I’m upgrading to 1.8.5-lite). In my inventory file, I am defining variables that set ansible_connection: network_cli and ansible_network_os: edgeswitch so Ansible know to use the edgeswitch module and that it is connecting to a network device via it’s CLI.

Playbook Breakdown

This file will be named edgeswitch-update.yml.

Gathering Facts

My playbook starts as such…

---
 - name: Check and Upgrade Edgeswitches
   hosts: edgeswitch
   gather_facts: no
   become: yes
   vars:
    upgrade_version: 1.8.5-lite
    ftp_user: admin
   vars_prompt:
    - name: ftp_pass
      prompt: "Password"
      confirm: yes
   tasks:
    - name: Gather Facts
      edgeswitch_facts:

The first line --- signifies that this is YAML file. Next the name of the playbook and hosts. In the fourth line one may notice that gather_facts: no, this is to keep Ansible from gathering system related facts from a Unix shell (which is not available on many network devices). Become: yes tells Ansible that it needs to become “root” which with the edgeswitch module, means use the enable command. Next we list several variables, upgrade_version and ftp_user which defines the firmware release that we will be checking to make sure switches are at and the username for the on network FTP server where we will be downloading the image for. The play will then prompt the user for a password for the FTP server and will ask twice using confirm. Then we begin our lists of tasks by calling the edgeswitch_facts module which will go out and gather things such as the host name, firmware version, hardware model, serial number, and some basic stats regarding the switch interfaces and store that information in memory as variables named ansible_net_serialnum, ansible_net_version, etc. which can be called upon at anytime throughout the play.

Save the running config

    - name: Save running config
      cli_command: 
       command: write memory
       prompt: Are you sure you want to save?
       answer: y
      when: ansible_net_version != upgrade_version

This task copies the running config to the startup config (so the switch reboots exactly as it is running) by calling upon the cli_command module. It issues the command, write memeory and waits for the CLI prompt to reply Are you sure you want to save? (the fully reply is “Are you sure you want to save? (y/n)” but Ansible is just matching the regex and that is plenty to obtain the match. and then Ansible answers with a confirmation of y. Now this task is only executed when the running version (ansible_net_version) does not equal the upgrade_version which we defined at the beginning of the play as 1.8.5-lite. If the device is already at 1.8.5-lite, the task will be skipped.

Copy firmware

    - name: Copy active firmware to backup (in case of upload failure)
      cli_command:
       command: copy active backup
      vars:
       ansible_command_timeout: 180
      when: ansible_net_version != upgrade_version
    - name: Copy upgrade version to backup
      cli_command:
       command: copy ftp://{{ ftp_user }}@192.168.255.1/ES-1.8.5-lite.stk backup
       check_all: true
       prompt: 
        - Remote Password
        - Are you sure you want to start?
       answer: 
         - "{{ ftp_pass }}"
         - y
      vars:
       ansible_command_timeout: 420
      when: ansible_net_version != upgrade_version

This is actually two tasks. The first task uses the cli_command module to copy the active firmware (i.e. what the switch is currently running on) to the backup firmware location and doesn’t timeout for 180 seconds (versus the normal 30 seconds) as Ubiquiti’s switches lockout all management function while using the copy command, also meaning the command prompt will not be received by Ansible until the copy finishes (takes about 90 seconds). If timeout occurs prior to that task being completed, the task will fail and Ansible will not continue running tasks against that host.

The second task again uses the cli_command module to “copy” (i.e. download) the new firmware from the FTP server. The variables defined at the beginning of the play are inserted at various points (ftp_user and ftp_pass) with check_all being used to check multiple prompts. The time out is also set for much longer (420 seconds though I’ve seen several plays used against Cisco switches timeout set to 600 seconds). Again, both tasks are only run when the ansible_net_version does not equal upgrade _version.

Change Boot Image and Save

    - name: Change Boot System
      cli_command:
       command: boot system backup
      when: ansible_net_version != upgrade_version
    - name: Saving running config to startup config
      cli_command:
       command: write memory
       prompt: Are you sure you want to save?
       answer: y

This simply changes the boot image from the current image to the image we just “hopefully” downloaded by telling the switch to boot off the backup image (once booted the backup image will become the active image and the active image will become the backup image). If the image download were to fail, the switch will discard the image and since we are changing the boot image if there was an ancient image that didn’t support our current configuration in the backup slot…we would boot off of that old image and things would be broken. Thus the reason for copying the active image to the backup image with the first copy task as if it fails the switch will at least reboot on a known working image. Again, only change the boot image when we aren’t running current firmware. Save the running config again (applies to all switches regardless of firmware).

Reboot and Verify Switch Comes Back Online

    - name: Restart Switch
      cli_command:
       command: reload in 00:01
      when: ansible_net_version != upgrade_version
    - name: Wait for device to come back online
      wait_for_connection:
       delay: 65
       sleep: 20
       connect_timeout: 10
      when: ansible_net_version != upgrade_version

Finally we issue a command to “reload” the switch in 1 minute when the current firmware is not the upgrade firmware. Normally you would simply issue the “reload” command but as EdgeSwitch OS doesn’t issue any kind of prompt afterwards (like “Rebooting switch” and doesn’t bother to kill the connection” you’ll get a timeout failure unless you tell it to reboot in x amount of time. Once that has completed, we will wait for the connection to come back up to verify management on the switch is working using the wait_for_connection module. Since we just exited the switch with a little under 1 minute to go until it reboots, we will delay the first attempt to re-establish the connection for 65 seconds, sleep for 20 seconds and then try again (for the next 10 minutes which is a default setting in the module). We consider the connection attempt to timeout if we don’t hear anything back within 10 seconds (default is 5 seconds which depending on the network may not be long enough).

That’s all

Now we can call this play using the command ansible-playbook edgeswitch-update.yml -K where the K flag means ask for become/enable password. As long as your hosts are in a group named edgeswitch and you have the correct login/network variables configured in the inventory file (you can also place them elsewhere in more advanced setups) it will go out and run. Now in the case that the firmware update fails, the switch will reboot on the firmware you copied over from the running image (so nothing should break) you’ll just take the outage while the switch reboots and can simply rerun the play at a later time to clean up the stragglers. This play took lots of trial and error to get working correctly as there really wasn’t any reference for running Ansible plays against a Ubiquiti EdgeSwitch.

1 Comment

Comments are closed