A simple but powerful Ansible playbook to troubleshoot OSPF network problems
on a variety of platforms. It is simple because it does not require
extensive preparatory configuration for individual host state checking.
It is powerful because despite not having the aforementioned level of
granularity, it rapidly discovers the vast majority of OSPF problems.
Contact information:
Email: njrusmc@gmail.com
Twitter: @nickrusso42518
Today, Cisco IOS/IOS-XE, IOS-XR, and NX-OS are supported. Valid device_type
options used for inventory groups are enumerated below. Each platform
has a folder in the devices/
directory, such as devices/ios/
. The
file named main.yml
is the task list that is included from the main
playbook which begins the device-specific tasks.
ios
: Cisco classic IOS and Cisco IOS-XE devices.iosxr
: Cisco IOS-XR devices.nxos
: Cisco NX-OS devices.Testing was conducted on the following platforms and versions:
Control machine information:
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)
$ uname -a
Linux ip-10-125-0-100.ec2.internal 3.10.0-693.el7.x86_64 #1 SMP
Thu Jul 6 19:56:57 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
$ ansible --version
ansible 2.8.7
config file = /home/ec2-user/racc/ansible.cfg
configured module search path = ['/home/ec2-user/.ansible/plugins/modules',
'/usr/share/ansible/plugins/modules']
ansible python module location =
/home/ec2-user/environments/racc287/lib/python3.7/site-packages/ansible
executable location = /home/ec2-user/environments/racc287/bin/ansible
python version = 3.7.3 (default, Aug 27 2019, 16:56:53)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]
The following tests are run in sequence. Note that the exact items tested
varies between platforms since command outputs and feature sets also vary.
Administrative tasks, such as creating directories and logging on the control
machine, are not detailed here for brevity.
Ansible logs into each OSPF router for the purpose of collecting information
and validating its correctness based on a small amount of pre-identified
state configuration. As discussed in the "variables" section, some of these
tests can be skipped by modifying the appropriate key/value pairs.
The list of tests run on each specific device are enumerated under each
README.md
file inside the subdirectories of devices/
correlating to
each unique device type.
After individual routers are validated, additional tests based on the
aggregated data from all routers are run. It is possible to run these
tests on a per-host basis, but that would effectively cause the same test
to be run N times rather than one time.
This solution uses a GNU Makefile
to simplify setup and daily operations.
The following make
targets are supported.
make
: Same as make test
make test
: Runs all testing (lint, unit, integ) in sequencemake setup
: Installs packages and builds the vault password filemake lint
: Runs YAML and Python linters along with Python formattermake unit
: Runs function-level testing on Python filtersmake integ
: Runs the test playbook (integration test)The following subsections detail the different types of variables, their
scopes, and their purposes within the playbook.
This playbook assumes that all OSPF routers are in a single process, and
if they are not, only a single process can be checked at a time.
Process-level variables differ between device types. For a list of supported
process variables, reference the individual README.md
files in each
devices/
subdirectory correlated to each device type.
This playbook allows an unlimited number of areas to be specified, each with
their own area-specific configuration. The playbook assumes that there are
no duplicate areas in the network. For example, while it is possible to have
two disparate area 1 sections of the network tied into area 0, this playbook
does not support it.
The top-level key is the area ID, specified as a string in the format
"area#"
where # is the ID itself. For example: "area0"
and "area51"
type
: The area type, specified as a string from the following options:"standard"
, "nssa"
, "stub"
. No other options are allowed, and"standard"
. This key is mandantory.routers
: The number of routers expected to exist in a given area,drs
: The number of designated routers expected to exist in a given area,has_frr
: Boolean representing whether OSPF Fast Re-Routetrue
) or disabled (false
) for this process. This test checks for themax_lsa3
: The maximum number of summary LSAs (LSA type-3) that should bemax_lsa7
: The maximum number of NSSA-external LSAs (LSA type-7) thatEach device type (ios
, iosxr
, etc.) has its own group_vars/
file which
contains OS-specific parameters. These should never be changed by consumers
as their main purpose is abstraction, not user input.
ansible_network_os
: A string representing the device OS name. These werecommands
: A list of strings representing the CLI commands to beOne special consideration is extended to ios
. Because classic IOS and
IOS-XE have minor differences in the commands they support, they use
difference command lists. The IOS-related group variables are as follows:
ios
: General IOS parameters, applying to classic and XE variantsiosclassic
: Command list specific to classic IOS devicesiosxe
: Command list specific to IOS-XE devicesAn example inventory might look like this:
all:
children:
ospf_routers:
children:
ios:
children:
iosxe:
hosts:
CSR1000:
ISR4451:
iosclassic:
hosts:
C3945E:
C3750X:
Note that some extra commands are appended to the end of the commands
list
which are used for collection only. The output from these commands is written
to the host log which can assist with troubleshooting, but it is not parsed
or checked in any way within the logic of the playbook.
This playbook aims to minimize the number of host-specific variables as
managing these inventory variables becomes burdensome in large networks.
my_areas
: List of integers representing the areas in which a given router[0]
. A[0, 51]
. Note that the playbook is[]
is not validmy_nbr_count
: Number of neighbors a specific router is expected to have.should_be_asbr
: Boolean representing whether a router should be antrue
indicatesfalse
indicates that a router should not be an ASBR.should_be_stub_rtr
: Boolean representing whether a router should be atrue
indicatesfalse
indicates that a router should not be a stub router.Given the generic nature of the playbook, some tests will fail with generic
error messages. For example, one host may fail because a router had an
incorrect number of actual neighbors, either greater than or less than
the user-configured my_nbr_count
expectation. By design, the playbook
lacks granularity to determine which neighbor failed and on which interface.
Logging can be toggled off an on by adjusting the log
variable which
can be true
or false
.
CLI output from all commands is written to a file in the logs/
directory.
A subdirectory for every execution of the playbook is created using
the format nots_<date/time>/
which contains all the individual log files.The date/time uses ISO8601 short format, such as
20180522T134558`. Log files
are not version controlled and are excluded from git automatically. An example
log directory after three playbook runs against an inventory of two hosts
(csr1 and csr2), would yield something like this:
$ tree logs/
logs
├── nots_20180522T192916
│ ├── csr1.txt
│ └── csr2.txt
├── nots_20180522T194610
│ ├── csr1.txt
│ └── csr2.txt
└── nots_20180522T197133
├── csr1.txt
└── csr2.txt
The contents of each log file begin with heading and trailing comment blocks
to show the command issued with its output. These logs are useful for finding
out why the playbook failed without having to manually log into failing hosts.
The example below shows the beginning of an IOS-based platform log file with
many redactions for brevity:
$ cat logs/nots_20180522T194610/csr1.txt
!!!
!!! Start command: show ip ospf 1
!!!
Routing Process "ospf 1" with ID 10.0.0.1
Start time: 00:02:24.532, Time elapsed: 00:48:30.920
Supports only single TOS(TOS0) routes
[snip, more output]
!!!
!!! End command: show ip ospf 1
!!!
!!!
!!! Start command: show ip ospf 1 neighbor
!!!
Neighbor ID Pri State Dead Time Address Interface
10.0.0.2 1 FULL/DR 00:00:39 192.168.102.2 Tunnel102
10.0.0.2 0 FULL/ - 00:00:37 192.168.101.2 Tunnel101
!!!
!!! End command: show ip ospf 1 neighbor
!!!
[snip, more commands]
Q: Most code across IOS, IOS-XR, and NX-OS is the same. Why not combine it?
A: The goal is to support more platforms in the future such as Cisco
ASA-OS, and possibly non-Cisco devices. These devices will likely return
different sets of information. This tool is designed to be simple,
not particularly advanced through layered abstractions.
Q: Why not use an API like RESTCONF or NETCONF instead of SSH + CLI?
A: This tool is designed for risk-averse users or managers that are not
rapidly migrating to API-based management. It is not an infrastructure-as-code
solution and does not manage device configurations. All of the commands used
in the playbook can be issued at privilege level 1 to further reduce risk.
With the exception of updating the login credentials and populating the
necessary variables, there is no complex setup work required.
Q: Why not parse the OSPF interfaces? Many errors occur at this level.
A: Parsing individual interfaces would require state declarations on a
per-host basis to determine what each interface should have. This defeats
the purpose of a simple, low-effort solution which uses only area and process
level parameters for verification. Furthermore, the detailed statistics
checking will alert the user to many errors (authentication, MTU mismatch, etc)
at a more general level. The user can check the logs to see the exact commands,
which includes the non-parsed interface text.
Q: For NX-OS why didn't you use the | json
filter from the CLI?
A: While this would have saved a lot of parsing code, I did not want to
have an inconsistent overall strategy for one network device. Additionally,
the filter does not render milliseconds properly (eg, SPF throttle timers)
which reduced my confidence in its overall accuracy.
This simple but powerful Ansible playbook can be used to troubleshoot OSPF network problems on a variety of platforms. It does not require extensive preparatory configuration for individual host state checking and rapidly discovers the vast majority of OSPF problems.
To use this code you will need: Set up a Linux VM as an Ansible client for testing the use case.
Code Exchange Community
Get help, share code, and collaborate with other developers in the Code Exchange community.View Community