Nick's OSPF TroubleShooter (nots)

A simple but powerful Ansible playbook to troubleshoot OSPF network problems
on a variety of platforms. It is simple because it does not require
extensive preparatory configuration for individual host state checking.
It is powerful because despite not having the aforementioned level of
granularity, it rapidly discovers the vast majority of OSPF problems.

Contact information:
Email: njrusmc@gmail.com
Twitter: @nickrusso42518

Supported platforms

Today, Cisco IOS/IOS-XE, IOS-XR, and NX-OS are supported. Valid device_type
options used for inventory groups are enumerated below. Each platform
has a folder in the devices/ directory, such as devices/ios/. The
file named main.yml is the task list that is included from the main
playbook which begins the device-specific tasks.

ios: Cisco classic IOS and Cisco IOS-XE devices.
iosxr: Cisco IOS-XR devices.
nxos: Cisco NX-OS devices.

Testing was conducted on the following platforms and versions:

Cisco CSR1000v, version 16.07.01a, running in AWS
Cisco CSR1000v, version 16.09.02, running in AWS
Cisco CSR1000v, version 16.12.01a, running in AWS
Cisco IOSv, version 15.6M, running on IOU
Cisco XRv9000, version 6.3.1, running in AWS
Cisco 3172T, version 6.0.2.U6.4a, hardware appliance

Control machine information:

$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)

$ uname -a
Linux ip-10-125-0-100.ec2.internal 3.10.0-693.el7.x86_64 #1 SMP
  Thu Jul 6 19:56:57 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
$ ansible --version
ansible 2.8.7
  config file = /home/ec2-user/racc/ansible.cfg
  configured module search path = ['/home/ec2-user/.ansible/plugins/modules',
    '/usr/share/ansible/plugins/modules']
  ansible python module location =
    /home/ec2-user/environments/racc287/lib/python3.7/site-packages/ansible
  executable location = /home/ec2-user/environments/racc287/bin/ansible
  python version = 3.7.3 (default, Aug 27 2019, 16:56:53)
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

Summarized test cases

The following tests are run in sequence. Note that the exact items tested
varies between platforms since command outputs and feature sets also vary.
Administrative tasks, such as creating directories and logging on the control
machine, are not detailed here for brevity.

Per device testing

Ansible logs into each OSPF router for the purpose of collecting information
and validating its correctness based on a small amount of pre-identified
state configuration. As discussed in the "variables" section, some of these
tests can be skipped by modifying the appropriate key/value pairs.

The list of tests run on each specific device are enumerated under each
README.md file inside the subdirectories of devices/ correlating to
each unique device type.

Whole network testing

After individual routers are validated, additional tests based on the
aggregated data from all routers are run. It is possible to run these
tests on a per-host basis, but that would effectively cause the same test
to be run N times rather than one time.

Ensure there are no duplicate OSPF router IDs. While it is technically
possible to duplicate RIDs in different areas (sometimes), there is no
legitimate reason to do it. This playbook always considers
duplicate RIDs to be an error condition.

Operations

This solution uses a GNU Makefile to simplify setup and daily operations.
The following make targets are supported.

make: Same as make test
make test: Runs all testing (lint, unit, integ) in sequence
make setup: Installs packages and builds the vault password file
make lint: Runs YAML and Python linters along with Python formatter
make unit: Runs function-level testing on Python filters
make integ: Runs the test playbook (integration test)

Variables

The following subsections detail the different types of variables, their
scopes, and their purposes within the playbook.

Process-level

This playbook assumes that all OSPF routers are in a single process, and
if they are not, only a single process can be checked at a time.

Process-level variables differ between device types. For a list of supported
process variables, reference the individual README.md files in each
devices/ subdirectory correlated to each device type.

Area-level

This playbook allows an unlimited number of areas to be specified, each with
their own area-specific configuration. The playbook assumes that there are
no duplicate areas in the network. For example, while it is possible to have
two disparate area 1 sections of the network tied into area 0, this playbook
does not support it.

The top-level key is the area ID, specified as a string in the format
"area#" where # is the ID itself. For example: "area0" and "area51"

type: The area type, specified as a string from the following options:
"standard", "nssa", "stub". No other options are allowed, and
area 0 must be type "standard". This key is mandantory.
routers: The number of routers expected to exist in a given area,
expressed as a positive integer. To disable this check, exclude this key.
drs: The number of designated routers expected to exist in a given area,
expressed as a positive integer. Note that DRs exist on broadcast-style
network segments only, and are unnecessary on point-to-point links over
broadcast media such as ethernet. To disable this check, exclude this key.
has_frr: Boolean representing whether OSPF Fast Re-Route
(also known as Loop Free Alternative/LFA) should be enabled
(true) or disabled (false) for this process. This test checks for the
basic enablement (or not) of this feature and not advanced derivates, such
as remote LFA (rLFA) or topology-independent LFA (TI-LFA).
To disable this check, exclude this key.
max_lsa3: The maximum number of summary LSAs (LSA type-3) that should be
present within an area. This inclusive upper bound enforces a limit on
the number of LSA3 for the purpose of flood reduction and memory
consumption. It can also enforce specific architectural designs. For
example, a totally stubby area with one ABR has only one LSA3 for the
default route, and this option can enforce this. This key is processed
for any area type. To disable this check, exclude this key.
max_lsa7: The maximum number of NSSA-external LSAs (LSA type-7) that
should be present within an area. This inclusive upper bound enforces a
limit on the number of LSA7 for the purpose of flood reduction and memory
consumption. It can also enforce specific architectural designs. For
example, an extranet NSSA acting as a non-transit buffer might be receiving
a small number of routes from a peer, which can be enforced. This key is
only process when the area type is "nssa".
To disable this check, exclude this key.

Device group level

Each device type (ios, iosxr, etc.) has its own group_vars/ file which
contains OS-specific parameters. These should never be changed by consumers
as their main purpose is abstraction, not user input.

ansible_network_os: A string representing the device OS name. These were
enumerated in the "Supported Platforms" section earlier in the document.
commands: A list of strings representing the CLI commands to be
issued to the device. These collect information from the devices relevant
to troubleshooting OSPF.

One special consideration is extended to ios. Because classic IOS and
IOS-XE have minor differences in the commands they support, they use
difference command lists. The IOS-related group variables are as follows:

ios: General IOS parameters, applying to classic and XE variants
iosclassic: Command list specific to classic IOS devices
iosxe: Command list specific to IOS-XE devices

An example inventory might look like this:

all:
  children:
    ospf_routers:
      children:
        ios:
          children:
            iosxe:
              hosts:
                CSR1000:
                ISR4451:
            iosclassic:
              hosts:
                C3945E:
                C3750X:

Note that some extra commands are appended to the end of the commands list
which are used for collection only. The output from these commands is written
to the host log which can assist with troubleshooting, but it is not parsed
or checked in any way within the logic of the playbook.

Host level

This playbook aims to minimize the number of host-specific variables as
managing these inventory variables becomes burdensome in large networks.

my_areas: List of integers representing the areas in which a given router
participates. For example, a router only in area 0 would use [0]. A
router in area 0 and 51 would use [0, 51]. Note that the playbook is
smart enough to identify whether a router is an Area Border Router (ABR)
or not based on its area membership. The empty list [] is not valid
since all OSPF routers must belong to at least one area.
This key is mandantory.
my_nbr_count: Number of neighbors a specific router is expected to have.
This is the grand total of all OSPF neighbors in a given process and is
not checked on a per-interface or per-area basis. It is a positive
integer. To disable this check, exclude this key. Disabling this check on
routers with a variable number of neighbors, such as an Internet VPN
concentrator, could be useful.
should_be_asbr: Boolean representing whether a router should be an
Autonomous System Boundary Router (ASBR). A value of true indicates
that a router should be an ASBR (note that this includes NSSA ABRs)
and a value of false indicates that a router should not be an ASBR.
To disable this check, exclude this key.
should_be_stub_rtr: Boolean representing whether a router should be a
stub router with max-metric advertised for all links. This reduces the
likelihood that a router is used for transit. A value of true indicates
that a router should be a stub router (minor options are not evaluated)
and a value of false indicates that a router should not be a stub router.
To disable this check, exclude this key.

Logging

Given the generic nature of the playbook, some tests will fail with generic
error messages. For example, one host may fail because a router had an
incorrect number of actual neighbors, either greater than or less than
the user-configured my_nbr_count expectation. By design, the playbook
lacks granularity to determine which neighbor failed and on which interface.
Logging can be toggled off an on by adjusting the log variable which
can be true or false.

CLI output from all commands is written to a file in the logs/ directory.
A subdirectory for every execution of the playbook is created using
the format nots_<date/time>/ which contains all the individual log files.The date/time uses ISO8601 short format, such as20180522T134558`. Log files
are not version controlled and are excluded from git automatically. An example
log directory after three playbook runs against an inventory of two hosts
(csr1 and csr2), would yield something like this:

$ tree logs/
logs
├── nots_20180522T192916
│   ├── csr1.txt
│   └── csr2.txt
├── nots_20180522T194610
│   ├── csr1.txt
│   └── csr2.txt
└── nots_20180522T197133
    ├── csr1.txt
    └── csr2.txt

The contents of each log file begin with heading and trailing comment blocks
to show the command issued with its output. These logs are useful for finding
out why the playbook failed without having to manually log into failing hosts.
The example below shows the beginning of an IOS-based platform log file with
many redactions for brevity:

$ cat logs/nots_20180522T194610/csr1.txt
!!!
!!! Start command: show ip ospf 1
!!!
Routing Process "ospf 1" with ID 10.0.0.1
 Start time: 00:02:24.532, Time elapsed: 00:48:30.920
 Supports only single TOS(TOS0) routes
[snip, more output]
!!!
!!! End command:   show ip ospf 1
!!!
!!!
!!! Start command: show ip ospf 1 neighbor
!!!
Neighbor ID     Pri   State           Dead Time   Address         Interface
10.0.0.2          1   FULL/DR         00:00:39    192.168.102.2   Tunnel102
10.0.0.2          0   FULL/  -        00:00:37    192.168.101.2   Tunnel101
!!!
!!! End command:   show ip ospf 1 neighbor
!!!
[snip, more commands]

FAQ

Q: Most code across IOS, IOS-XR, and NX-OS is the same. Why not combine it?
A: The goal is to support more platforms in the future such as Cisco
ASA-OS, and possibly non-Cisco devices. These devices will likely return
different sets of information. This tool is designed to be simple,
not particularly advanced through layered abstractions.

Q: Why not use an API like RESTCONF or NETCONF instead of SSH + CLI?
A: This tool is designed for risk-averse users or managers that are not
rapidly migrating to API-based management. It is not an infrastructure-as-code
solution and does not manage device configurations. All of the commands used
in the playbook can be issued at privilege level 1 to further reduce risk.
With the exception of updating the login credentials and populating the
necessary variables, there is no complex setup work required.

Q: Why not parse the OSPF interfaces? Many errors occur at this level.
A: Parsing individual interfaces would require state declarations on a
per-host basis to determine what each interface should have. This defeats
the purpose of a simple, low-effort solution which uses only area and process
level parameters for verification. Furthermore, the detailed statistics
checking will alert the user to many errors (authentication, MTU mismatch, etc)
at a more general level. The user can check the logs to see the exact commands,
which includes the non-parsed interface text.

Q: For NX-OS why didn't you use the | json filter from the CLI?
A: While this would have saved a lot of parsing code, I did not want to
have an inconsistent overall strategy for one network device. Additionally,
the filter does not render milliseconds properly (eg, SPF throttle timers)
which reduced my confidence in its overall accuracy.

Use Case

Ansible to troubleshoot OSPF issues

This simple but powerful Ansible playbook can be used to troubleshoot OSPF network problems on a variety of platforms. It does not require extensive preparatory configuration for individual host state checking and rapidly discovers the vast majority of OSPF problems.

Objective

Supports Cisco IOS/IOS-XE, IOS-XR, and NX-OS platforms.
Used to troubleshoot OSPF network problems on a variety of platforms.
Rapidly discovers the vast majority of OSPF problems and does not require extensive preparatory configuration for individual host state checking.
Performs all the testing in series like lint, unit, and playbook.

Requirements

To use this code you will need: Set up a Linux VM as an Ansible client for testing the use case.

Instructions

Ansible is required, Version 2.6.2 is used when developing this repository.
Create a Python3 virtual environment.
Activate the virtual environment.
Install requirements
pip install -r requirements.txt
Run "make setup" for creating a vault password.
Before running the script, execute "make lint" and "make unit" commands.
Run "make integ" command.

Demo

Learning Labs

Introduction to Ansible for IOS XE Configuration Management