- About
- NSO 5.7 Getting Started Guide
- NSO 5.7 User Guide
- NSO Installation Guide
- NSO 5.7 Administration Guide
- Preface
- Introduction
- NSO System Management
- Cisco Smart Licensing
- NSO Alarms
- NSO Packages
- Advanced Topics
- High Availability
- Rollbacks
- The AAA infrastructure
- NSO Deployment
- NSO Deployment
- Introduction
- Initial NSO installation
- Initial NSO configuration - ncs.conf
- Setting up AAA
- Cisco Smart Licensing
- Global settings and timeouts
- Enabling SNMP
- Loading the required NSO packages
- Preparing the HA of the NSO installation
- Handling tailf-hcc HA fallout
- Preparing the clustering of the NSO installation
- Testing the cluster configuration
- Upgrading
- Log management
- Monitoring the installation
- Security considerations
- NSO 5.7 Northbound APIs
- NSO 5.7 Development Guide
- Preface
- The Configuration Database and YANG
- Basic Automation with Python
- Creating a Service
- Applications in NSO
- The NSO Java VM
- The NSO Python VM
- Embedded Erlang applications
- The YANG Data Modeling Language
- Using CDB
- Java API Overview
- Python API Overview
- NSO Packages
- Package Development
- Developing NSO Services
- Templates
- NED Upgrades and Migration
- Developing Alarm Applications
- SNMP Notification Receiver
- The web server
- Kicker
- Scheduler
- Progress Trace
- Nano Services for Staged Provisioning
- Encryption Keys
- External Logging
- NSO 5.7 Web UI
- NSO CDM Migration Guide
- NSO Layered Service Architecture
- NSO 5.7 NED Development
- NSO 5.7 Manual Pages
- SDK API Reference
- NSO on DevNet
- Get Support
This chapter is written as a series of examples. We'll be describing a typical large scale deployment where the following topics will be covered:
-
Installation of NSO on all hosts
-
Initial configuration of NSO on all hosts
-
Upgrade of NSO on all hosts
-
Upgrade of NSO packages/NEDs on all hosts
-
Monitoring the installation
-
Trouble shooting, backups and disaster recovery
-
Security considerations
We'll be using a Layered Service Architecture cluster as an example deployment. The deployment consists of four hosts, a CFS node pair and a RFS node pair. The two CFS nodes are an NSO HA-pair, and so are the two RFS nodes. The following picture shows our deployment.
Thus the two NSO hosts cfs-m and cfs-s make up one HA pair, one active and one standby, and similarly for the two so called RFS nodes, rfs-m1 and rfs-s1. The HA setup as well as the cluster setup will be thoroughly described later in this chapter.
The cluster part is really optional, it's only needed if you are expecting the amount of managed devices and/or the amount of instantiated services to be so large so that it doesn't fit in memory on a single NSO host. If for example the expected number of managed devices and services is less than 20k, it's recommended to not use clustering at all and instead equip the NSO hosts with sufficient RAM. Installation, performance, bug search, observation, maintenance, all become harder with clustering turned on.
HA on the other hand is usually not optional for customer deployment. Data resides in CDB, which is a RAM database with a disk based journal for persistence. One possibility to run NSO without the HA component could be to use a fault tolerant filesystem, such as CEPH. This would mean that provisioning data survives a disk crash on the NSO host, but fail over would require manual intervention. As we shall see, the HA component we have tailf-hcc also requires some manual intervention, but only after an automatic fail over.
In this chapter we shall describe a complete HA/cluster setup though, you will have to decide for your deployment weather HA and/or clustering is required.
We will perform an NSO system installation on 4 NSO hosts.
NSO comes with a tool called nct
which is ideal
for the task at hand here. nct
has it's own documentation
and will not be described here. nct is shipped together with
NSO.
The following prerequisites are needed for the nct
operations.
-
We need a user on the management station which has sudo rights on the four NSO hosts. Most of the
nct
operations that you'll execute towards the 4 NSO hosts require root privileges. -
Access to the NSO .bin install package as well as access to The NEDs and all packages you're planning to run. The packages shall be on the form of tar.gz packages.
We'll be needing an NCT_HOSTSFILE
.
In this example it looks like:
$ echo $NCT_HOSTFILE /home/klacke/nct-hosts $ cat $NCT_HOSTFILE {"10.147.40.80", [{name, "rfs-s1"}, {restconf_pass, "MYPASS"}, {ssh_pass, "MYPASS"}, {netconf_pass, "MYPASS"}, {restconf_port, 8888}]}. {"10.147.40.78", [{name, "rfs-m"}, {restconf_pass, "MYPASS"}, {ssh_pass, "MYPASS"}, {netconf_pass, "MYPASS"}, {restconf_port, 8888}]}. {"10.147.40.190", [{name, "cfs-m"}, {restconf_pass, "MYPASS"}, {ssh_pass, "MYPASS"}, {netconf_pass, "MYPASS"}, {restconf_port, 8888}]}. {"10.147.40.77", [{name, "cfs-s"}, {restconf_pass, "MYPASS"}, {ssh_pass, "MYPASS"}, {netconf_pass, "MYPASS"}, {restconf_port, 8888}]}. $ ls -lat /home/klacke/nct-hosts -rw------- 1 klacke staff 1015 Jan 22 13:12 /home/klacke/nct-hosts
The different passwords in nct-hosts
file are
all my regular Linux password on the target host.
We can use SSH keys, especially for normal SSH shell login,
however, unfortunately, the nct
tool
doesn't work well with ssh-agent, thus
the keys shouldn't have a pass phrase, if they do, we'll have
to enter the pass phrase over and over again while using
nct
. Since ssh-agent
doesn't work, and we'll be needing the password for the REST
api access anyway, the recommended setup is to store the password
for the target hosts in a read-only file. This is for ad-hoc
nct
usage.
This data is needed on the management station. That can be one of the
4 NSO hosts, but it can also be another host, e.g an operator
laptop. One convenient way to get easy access to the nct
command is to do a "local install" of NSO on the management station.
To test nct, and the SSH key setup you can do:
$ nct ssh-cmd -c 'sudo id' SSH command to 10.147.40.80:22 [rfs-s1] SSH OK : 'ssh sudo id' returned: uid=0(root) gid=0(root) groups=0(root) SSH command to 10.147.40.78:22 [rfs-m1] SSH OK : 'ssh sudo id' returned: uid=0(root) gid=0(root) groups=0(root) SSH command to 10.147.40.190:22 [cfs-m] SSH OK : 'ssh sudo id' returned: uid=0(root) gid=0(root) groups=0(root) SSH command to 10.147.40.77:22 [cfs-s] SSH OK : 'ssh sudo id' returned: uid=0(root) gid=0(root) groups=0(root)
Now you are ready to execute the NSO installer on all the 4 NSO hosts.
This is done through the nct
command install
.
$ nct install --file ./nso-4.1.linux.x86_64.installer.bin --progress true ............ Install NCS to 10.147.40.80:22 Installation started, see : /tmp/nso-4.1.linux.x86_64.installer.bin.log Install NCS to 10.147.40.78:22 Installation started, see : /tmp/nso-4.1.linux.x86_64.installer.bin.log Install NCS to 10.147.40.190:22 Installation started, see : /tmp/nso-4.1.linux.x86_64.installer.bin.log Install NCS to 10.147.40.77:22 Installation started, see : /tmp/nso-4.1.linux.x86_64.installer.bin.log
If you for some reason want to undo everything and start over from scratch, the following command cleans up everything on all the NSO hosts.
$ nct ssh-cmd \ -c 'sudo /opt/ncs/current/bin/ncs-uninstall --non-interactive --all'
At this point NSO is properly installed on the NSO hosts. The default options were used for the NSO installer, thus files end up in the normal places on the NSO hosts. We have:
-
Boot files in
/etc/init.d
, NSO configuration files in/etc/ncs
and shell files under/etc/profile.d
-
NSO run dir, with CDB database, packages directory, NSO state directory in
/var/opt/ncs
-
Log files in
/var/log/ncs
-
The releases structure in
/opt/ncs
with man pages for all NSO related commands under/opt/ncs/current/man
To read more about this, see man page ncs-installer(1)
After installation the configuration needs to be updated to be
in sync on all 4 hosts. The configuration file
/etc/ncs/ncs.conf
should be identical on
all hosts. Note that the configuration for encrypted strings is
generated during installation. The keys are stored in the file
/etc/ncs/ncs.crypto_keys
and should be
copied from one of the hosts to the remaining three.
The required services and authentication needs to be configured taking security requirements into account. It is recommended to use PAM for authenticating users although it is possible to have users in NSO CDB database.
To keep configuration in sync between the hosts, copy
/etc/ncs/ncs.conf
and
/etc/ncs/ncs.crypto_keys
from one of the
hosts to a management station and edit it there. See NSO man page
ncs.conf(1)
for all the settings of
ncs.conf
-
Enable the NSO ssh CLI login.
/ncs-config/cli/ssh/enabled
-
Modify the CLI prompt so that the hostname is part of the CLI prompt.
/ncs-config/cli/prompt
<prompt1>\u@\H> </prompt1> <prompt2>\u@\H% </prompt2> <c-prompt1>\u@\H# </c-prompt1> <c-prompt2>\u@\H(\m)# </c-prompt2>
-
Enable the NSO HTTPS interface under
/ncs-config/webui/
, along with/ncs-config/webui/match-host-name = true
and/ncs-config/webui/server-name
set to the hostname of this node, following security best practice.The SSL certificates that get distributed with NSO are self signed.
$ openssl x509 -in /etc/ncs/ssl/cert/host.cert -text -noout Certificate: Data: Version: 1 (0x0) Serial Number: 2 (0x2) Signature Algorithm: sha256WithRSAEncryption Issuer: C=US, ST=California, O=Internet Widgits Pty Ltd, CN=John Smith Validity Not Before: Dec 18 11:17:50 2015 GMT Not After : Dec 15 11:17:50 2025 GMT Subject: C=US, ST=California, O=Internet Widgits Pty Ltd Subject Public Key Info: .......
Thus, if this is a real production environment, and the Web/REST interface is used for something which is not solely internal purposes it's a good idea to replace the self signed certificate with a properly signed certificate.
-
Disable
/ncs-config/webui/cgi
unless needed. -
Enable the NSO netconf SSH interface
/ncs-config/netconf-northbound/
-
Enable the NSO ha in
ncs.conf
.<ha> <enabled>true</enabled> </ha>
-
PAM - the recommended authentication setting for NSO is to rely on Linux PAM. Thus all remote access to NSO must be done using real host privileges. Depending on your Linux distro, you may have to change
/ncs-config/aaa/pam/service
. The default value iscommon-auth
. Check the file/etc/pam.d/common-auth
and make sure it fits your needs. -
Depending on the type of provisioning applications you have, you might want to turn
/ncs-config/rollback/enabled
off. Rollbacks don't work that well with reactive-fastmap applications. If your application is a classical NSO provisioning application, the recommendation is to enable rollbacks, otherwise not.
Now that you have a proper ncs.conf
- the same config
files can be used on all the 4 NSO hosts, we can copy the modified file to
all hosts. To do this we use the nct
command:
$ nct copy --file ncs.conf $ nct ssh-cmd -c 'sudo mv /tmp/ncs.conf /etc/ncs' $ nct ssh-cmd -c 'sudo chmod 600 /etc/ncs/ncs.conf'
Or use the builtin support for the ncs.conf file:
$ nct load-config --file ncs.conf --type ncs-conf
The ncs.crypto_keys file must also be copied if the standard encrypted-strings configuration is used:
$ nct copy --file ncs.crypto_keys $ nct ssh-cmd -c 'sudo mv /tmp/ncs.crypto_keys /etc/ncs' $ nct ssh-cmd -c 'sudo chmod 400 /etc/ncs/ncs.crypto_keys'
Note that the ncs.crypto_keys
is
highly sensitive. The file contains the encryption keys for all CDB
data that is encrypted on disk. This usually contains passwords etc
for various entities, such as login credentials to managed devices.
In YANG parlance, this is all YANG data modeled with the types
tailf:des3-cbc-encrypted-string
,
tailf:aes-cfb-128-encrypted-string
or
tailf:aes-256-cfb-128-encrypted-string
As we saw in the previous section, the REST HTTPS api is enabled.
This API is used by a few of the crucial nct
commands, thus if we want to use nct
, we must
enables password based REST login (through PAM)
The default AAA initialization file that gets shipped with NSO
resides under
/var/opt/ncs/cdb/aaa_init.xml
.
If we're not happy with that, this is a good point in time to
modify the initialization data for AAA.
The NSO daemon is still not running, and we have no existing CDB files.
The defaults are restrictive and fine though, so we'll keep them
here.
Looking at the aaa_init.xml
file we see that
two groups are referred to in the NACM rule list,
ncsadmin and ncsoper.
The NSO authorization system is group based, thus for the rules to
apply for a specific user, the user must be member of the right
group. Authentication is performed by PAM, and authorization is performed
by the NSO NACM rules. Adding myself to ncsadmin
group will ensure that I get properly authorized.
$ nct ssh-cmd -c 'sudo addgroup ncsadmin' $ nct ssh-cmd -c 'sudo adduser $USER ncsadmin'
Henceforth I will log into the different NSO hosts using my own login credentials. There are many advantages to this scheme, the main one being that all audit logs on the NSO hosts will show who did what and when. The common scheme of having a shared admin user with a shared password is not recommended.
To test the NSO logins, we must first start NSO:
$ nct ssh-cmd -c 'sudo /etc/init.d/ncs start'
Or use the nct command nct start
:
$ nct start
At this point we should be able to curl login over RESTCONF, and also directly log in remotely to the NSO cli. On the admin host:
$ ssh -p 2024 cfs-m klacke connected from 10.147.40.94 using ssh on cfs-m klacke@cfs-m> exit Connection to cfs-m closed.
Checking the NSO audit log on the NSO host cfs-m
we see at the end of /var/log/ncs/audit.log
<INFO> 5-Jan-2016::15:51:10.425 cfs-m ncs[666]: audit user: klacke/0 logged in over ssh from 10.147.40.94 with authmeth:publickey <INFO> 5-Jan-2016::15:51:10.442 cfs-m ncs[666]: audit user: klacke/21 assigned to groups: ncsadmin,sambashare,lpadmin, klacke,plugdev,dip,sudo,cdrom,adm <INFO> 5-Jan-2016::16:03:42.723 cfs-m ncs[666]: audit user: klacke/21 CLI 'exit' <INFO> 5-Jan-2016::16:03:42.927 cfs-m ncs[666]: audit user: klacke/0 Logged out ssh <publickey> user
Especially the group assignment is worth mentioning here, we were assigned to the recently created ncsadmin group. Testing the RESTCONF api we get:
$ curl -u klacke:PASSW http://cfs-m:8080/restconf -X GET curl: (7) Failed to connect to cfs-m port 8080: Connection refused $ curl -k -u klacke:PASSW https://cfs-m:8888/restconf -X GET <restconf xmlns="urn:ietf:params:xml:ns:yang:ietf-restconf"> <data/> <operations/> <yang-library-version>2019-01-04</yang-library-version> <operational/> </restconf>
The nct check
command is a good command to check
all 4 NSO hosts in one go:
nct check --restconf-pass PASSW --restconf-port 8888 -c all
NSO uses Cisco Smart Licensing, described in detail in
Cisco Smart Licensing
.
After you have registered your NSO instance(s), and
received a token, by
following step 1-6 as described in the Create a License
Registration Token section of
Cisco Smart Licensing
,
you need to enter a token from your
Cisco Smart Software Manager account on each host. You can use
the same token for all instances.
We can use the nct cli-cmd
tool to do this on all
NSO hosts:
$ nct cli-cmd --style cisco -c 'license smart register idtoken YzY2Yj...'
Note
The Cisco Smart Licensing CLI command is present only in the Cisco
Style CLI, so make sure you use the --style cisco
flag with nct cli-cmd
Depending on your installation, the size and speed of the managed devices, as well as the characteristics of your service applications - some of the default values of NSO may have to be tweaked. In particular some of the timeouts.
-
Device timeouts. NSO has connect/read/and write timeouts for traffic that goes from NSO to the managed devices. The default value is 20 seconds for all three. Some routers are slow to commit, some are sometimes slow to deliver it's full configuration. Adjust timeouts under
/devices/global-settings
accordingly. -
Service code timeouts. Some service applications can sometimes be slow. In order to minimize the chance of a service application timing out - adjusting
/services/global-settings/service-callback-timeout
might be applicable - depending on the application.
There are quite a few different global settings to NSO, the two mentioned above usually needs to be changed. On the management station:
$ cat globs.xml <config xmlns="http://tail-f.com/ns/config/1.0"> <devices xmlns="http://tail-f.com/ns/ncs"> <global-settings> <connect-timeout>120</connect-timeout> <read-timeout>120</read-timeout> <write-timeout>120</write-timeout> <trace-dir>/var/log/ncs</trace-dir> </global-settings> </devices> <services xmlns="http://tail-f.com/ns/ncs"> <global-settings> <service-callback-timeout>180</service-callback-timeout> <global-settings> </java-vm> </config> $ nct load-config --file globs.xml --type xml
For real deployments we usually want to enable SNMP. Two reasons:
-
When NSO alarms are created - SNMP traps automatically get created and sent - thus we typically want to enable SNMP and also set one or more trap targets
-
Many organizations have SNMP based monitoring systems, in order to enable SNMP based system to monitor NSO we need SNMP enabled.
There is already a decent SNMP configuration in place, it just needs a few extra localizations: We need to enable SNMP, and decide:
-
If and where-to send SNMP traps
-
Which SNMP security model to choose.
At a minimum we could have:
klacke@cfs-s% show snmp agent { enabled; ip 0.0.0.0; udp-port 161; version { v1; v2c; v3; } engine-id { enterprise-number 32473; from-text testing; } max-message-size 50000; } system { contact Klacke; name nso; location Stockholm; } target test { ip 3.4.5.6; udp-port 162; tag [ x ]; timeout 1500; retries 3; v2c { sec-name test; } } community test { sec-name test; } notify test { tag x; type trap; }
We'll be using a couple of packages to illustrate the process of managing packages over a set of NSO nodes. The first prerequisite here is that all nodes must have the same version of all packages. If not, havoc will wreak. In particular HA will break, since a check is run while establishing a connection between the secondary and the primary, ensuring that both nodes have exactly the same NSO packages loaded.
On our management station we have the following NSO packages.
$ ls -lt packages total 15416 -rw-r--r-- 1 klacke klacke 8255 Jan 5 13:10 ncs-4.1-nso-util-1.0.tar.gz -rw-r--r-- 1 klacke klacke 14399526 Jan 5 13:09 ncs-4.1-cisco-ios-4.0.2.tar.gz -rw-r--r-- 1 klacke klacke 1369969 Jan 5 13:07 ncs-4.1-tailf-hcc-4.0.1.tar.gz
Package management in an NSO system install is a three-stage process.
-
First, all versions of all all packages, all reside in
/opt/ncs/packages
Since this is the initial install, we'll only have a single version of our 3 example packages. -
The version of each package we want to use will reside as a symlink in
/var/opt/ncs/packages/
-
And finally, the package which is actually running, will reside under
/var/opt/ncs/state/packages-in-use.cur
The tool here is nct packages
, it can be used
to upload and install our packages in stages.
The nct packages
command work over the RESTCONF api,
thus in the following examples I have added
{restconf_user, "klacke"}
and also
{restconf_port, 8888}
to my $NCT_HOSTSFILE.
We upload all our packages as:
$ for p in packages/*; do nct packages --file $p -c fetch --restconf-pass PASSW done Fetch Package at 10.147.40.80:8888 OK ......
Verifying on one of the NSO hosts:
$ ls /opt/ncs/packages/ ncs-4.1-cisco-ios-4.0.2.tar.gz ncs-4.1-tailf-hcc-4.0.1.tar.gz ncs-4.1-nso-util-1.0.tar.gz
Verifying with the nct
command
$ nct packages --restconf-pass PASSW list Package Info at 10.147.40.80:8888 ncs-4.1-cisco-ios-4.0.2 (installable) ncs-4.1-nso-util-1.0 (installable) ncs-4.1-tailf-hcc-4.0.1 (installable) Package Info at 10.147.40.78:8888 ncs-4.1-cisco-ios-4.0.2 (installable) ncs-4.1-nso-util-1.0 (installable) ncs-4.1-tailf-hcc-4.0.1 (installable) Package Info at 10.147.40.190:8888 ncs-4.1-cisco-ios-4.0.2 (installable) ncs-4.1-nso-util-1.0 (installable) ncs-4.1-tailf-hcc-4.0.1 (installable) Package Info at 10.147.40.77:8888 ncs-4.1-cisco-ios-4.0.2 (installable) ncs-4.1-nso-util-1.0 (installable) ncs-4.1-tailf-hcc-4.0.1 (installable)
Next step is to install the packages. As stated above, package management
in NSO is a three-stage process, we have now covered step one. The packages
reside on the NSO hosts. Step two is to install
the 3 packages. This is also done through the nct
command as:
$ nct packages --package ncs-4.1-cisco-ios-4.0.2 --restconf-pass PASSW -c install $ nct packages --package ncs-4.1-nso-util-1.0 --restconf-pass PASSW -c install $ nct packages --package ncs-4.1-tailf-hcc-4.0.1 --restconf-pass PASSW -c install
This command will setup the symbolic links from
/var/opt/ncs/packages
to
/opt/ncs/packages
. NSO is still running with
the previous set of packages. Actually, even a restart of NSO will
run with the previous set of packages. The packages that get loaded
at startup time reside under
/var/opt/ncs/state/packages-in-use.cur
.
To force a single node to restart using the set of
installed packages under
/var/opt/ncs/packages
we can do:
/etc/init.d/ncs restart-with-package-reload
This is a full NSO restart, depending on the amount of data in CDB and also depending on which data models are actually updated, it's usually faster to have the NSO node reload the data models and do the schema upgrade while running. The NSO CLI has support for this using the CLI command.
$ ncs_cli klacke connected from 10.147.40.113 using ssh on cfs-m klacke@cfs-m> request packages reload
Here however, we wish to do the data model upgrade on all 4 NSO hosts, the
nct
tool can do this as:
$ nct packages --restconf-pass PASSW -c reload Reload Packages at 10.147.40.80:8888 cisco-ios true nso-util true tailf-hcc true Reload Packages at 10.147.40.78:8888 cisco-ios true nso-util true tailf-hcc true Reload Packages at 10.147.40.190:8888 cisco-ios true nso-util true tailf-hcc true Reload Packages at 10.147.40.77:8888 cisco-ios true nso-util true tailf-hcc true
To verify that all packages are indeed loaded and also running we can do the following in the CLI:
$ ncs_cli klacke@cfs-m> show status packages package oper-status package cisco-ios { oper-status { up; } } package nso-util { oper-status { up; } } package tailf-hcc { oper-status { up; } }
We can use the nct
tool to do it on all
NSO hosts
$ nct cli-cmd -c 'show status packages package oper-status'
This section covered initial loading of NSO packages, in a later section we will also cover upgrade of existing packages.
In this example we will be running with two HA-pairs, the two CFS
nodes will make up one HA-pair and the two RFS nodes will make up
another HA-pair. We will use the tailf-hcc
package as a HA framework. The package itself is well documented
thus that will not be described here. Instead we'll just show
a simple standard configuration of tailf-hcc
and we'll focus on issues when managing and upgrading an HA cluster.
One simple alternative to the tailf-hcc package
is to use completely manual HA, i.e HA entirely without automatic
failover. An example of code that accomplish this can be found
in the NSO example collection under
examples.ncs/web-server-farm/ha/packages/manual-ha
I have also modified the
$NCT_HOSTSFILE to have a few groups so that we can do
nct
commands to groups of NSO hosts.
If we plan to use VIP fail over,
a prerequisite is the arping
command and the ip
command
$ nct ssh-cmd -c 'sudo aptitude -y install arping' $ nct ssh-cmd -c 'sudo aptitude -y install iprout2'
The tailf-hcc package gives us two things and only that:
-
All CDB data becomes replicated from the primary to the secondary
-
If the primary fails, the secondary takes over and starts to act as primary. I.e the package automatically handles one fail over. At fail over, the tailf-hcc either brings up a Virtual alias IP address using gratuitous ARP or be means of Quagga/BGP announce a better route to an anycast IP address.
Thus we become resilient to NSO host failures. However it's important to realize that the tailf-hcc is fairly primitive once a fail over has occurred. We shall run through a couple of failure scenarios in this section.
Following the tailf-hcc documentation we have
the same HA configuration on both cfs-m
and cfs-s. The tool to use in order to
push identical config to two nodes, is
nct load-config
. We prepare the configuration
as XML data on the management station:
$ dep cat srv-ha.xml <ha xmlns="http://tail-f.com/pkg/tailf-hcc"> <token>xyz</token> <interval>4</interval> <failure-limit>10</failure-limit> <member> <name>cfs-m</name> <address>10.147.40.190</address> <default-ha-role>master</default-ha-role> </member> <member> <name>cfs-s</name> <address>10.147.40.77</address> <default-ha-role>slave</default-ha-role> <failover-master>true</failover-master> </member> </ha> $ nct load-config --file srv-ha.xml --type xml --group srv Node 10.147.40.190 [cfs-m] load-config result : successfully loaded srv-ha.xml with ncs_load Node 10.147.40.77 [cfs-s] load-config result : successfully loaded srv-ha.xml with ncs_load
The last piece of the puzzle here is now to activate HA. The configuration
is now there on both the service nodes.
We use the nct ha
command to basically just execute
the CLI command request ha commands activate
on
the two service nodes.
$ nct ha --group srv --action activate --restconf-pass PASSW
To verify the HA status we do:
$ nct ha --group srv --action status --restconf-pass PASSW HA Node 10.147.40.190:8888 [cfs-m] cfs-m[master] connected cfs-s[slave] HA Node 10.147.40.77:8888 [cfs-s] cfs-s[slave] connected cfs-m[master]
To verify the whole setup, we can now also run the
nct check
command, now that HA is operational
$ nct check --group srv --restconf-pass PASSW --netconf-user klacke all ALL Check to 10.147.40.190:22 [cfs-m] SSH OK : 'ssh uname' returned: Linux SSH+SUDO OK DISK-USAGE FileSys=/dev/sda1 (/var,/opt) Use=37% RESTCONF OK NETCONF OK NCS-VSN : 4.1 HA : mode=master, node-id=cfs-m, connected-slave=cfs-s ALL Check to 10.147.40.77:22 [cfs-s] SSH OK : 'ssh uname' returned: Linux SSH+SUDO OK DISK-USAGE FileSys=/dev/sda1 (/var,/opt) Use=37% RESTCONF OK NETCONF OK NCS-VSN : 4.1 HA : mode=slave, node-id=cfs-s, master-node-id=cfs-m
As previously indicated, the tailf-hcc is not especially sophisticated. Here follows a list of error scenarios after which the operator must act. This section applies to tailf-hcc 4.x and earlier.
If the cfs-s node reboots, NSO will start from the /etc boot scripts. The HA component cannot automatically decide what to do though. It will await an explicit operator command. After reboot, we will see:
klacke@cfs-s> show status ncs-state ha mode none; [ok][2016-01-07 18:36:41] klacke@cfs-s> show status ha member cfs-m { current-ha-role unknown; } member cfs-s { current-ha-role unknown; }
On the designated secondary node, and the preferred primary node cfs-m will show:
klacke@cfs-m> show status ha member cfs-m { current-ha-role master; } member cfs-s { current-ha-role unknown; }
To remedy this, the operator must once again activate
HA. In suffices to to it on cfs-s, but we can
in this case safely do it on both nodes, even doing it using the
nct ha
command.
Re-activating HA on cfs-s will ensure that
-
All data from cfs-m is copied to cfs-s.
-
Ensure that all future configuration changes (that have to go through cfs-m are replicated.
This is the interesting fail over scenario. Powering off the cfs-m primary node, we see the following on cfs-s:
-
An alarm gets created on the designated secondary
alarm-list { number-of-alarms 1; last-changed 2016-01-11T12:48:45.143+00:00; alarm ncs node-failure /ha/member[name='cfs-m'] "" { is-cleared false; last-status-change 2016-01-11T12:48:45.143+00:00; last-perceived-severity critical; last-alarm-text "HA connection lost. 'cfs-s' transitioning to HA MASTER role. When the problem has been fixed, role-override the old MASTER to SLAVE to prevent config loss, then role-revert all nodes. This will clear the alarm."; .....
-
Fail over occurred:
klacke@cfs-s> show status ha member cfs-m { current-ha-role unknown; } member cfs-s { current-ha-role master; }
This is a critical moment, HA has failed over. When the original primary cfs-m restarts, the operator MUST manually decide what to do. Restarting cfs-m we get:
klacke@cfs-m> show status ha member cfs-m { current-ha-role unknown; } member cfs-s { current-ha-role unknown; }
If we now activate the original primary, it will resume it's former primary role. Since cfs-s already is primary, this will be a mistake. Instead we must
klacke@cfs-m> request ha commands role-override role slave status override [ok][2016-01-11 14:28:27] klacke@cfs-m> request ha commands activate status activated [ok][2016-01-11 14:28:42] klacke@cfs-m> show status ha member cfs-m { current-ha-role slave; } member cfs-s { current-ha-role master; }
This means that all config from cfs-s will be copied back to cfs-m. Once HA is once again established, we can easily go back to original situation by executing:
klacke@cfs-m> request ha commands role-revert
on both nodes. This is recommended in order to have the running situation as normal as possible.
Note: This is indeed a critical operation. It's actually possible to loose all or some data here. For example, assume that the original primary cfs-m was down for a period of time, the following sequence of events/commands will loose data.
-
cfs-m goes down at time t0
-
Node cfs-s continues to process provisioning request until time t1 when it goes down.
-
Node cfs-s goes down
-
The original primary cfs-m comes up and the operator activates cfs-m manually. At which time it can start to process provisioning requests.
The above sequence of events/commands loose all provisioning requests between t0 and t1
The final part of configuring HA is enabling either IP layer 2 VIP (Virtual IP) support or IP layer 3 BGP anycast fail over. Here we will describe layer 2 VIP configuration, details about anycast setup can be found in the tailf-hcc documentation.
We modify the HA configuration so that it looks as:
klacke@cfs-m% show ha token xyz; interval 4; failure-limit 10; vip { address 10.147.41.253; } member cfs-m { address 10.147.40.190; default-ha-role master; vip-interface eth0; } member cfs-s { address 10.147.40.77; default-ha-role slave; failover-master true; vip-interface eth0; }
Whenever a node is primary, it will also bring up a VIP.
$ ifconfig eth0:ncsvip eth0:ncsvip Link encap:Ethernet HWaddr 08:00:27:0c:c3:48 inet addr:10.147.41.253 Bcast:10.147.41.255 Mask:255.255.254.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
The purpose of the VIP is to have northbound systems, northbound provisioning systems, activate NSO services through the VIP which will always be reachable as long as one NSO system is still up.
Referring to previous discussion above on manual activation, if the operator reactivates HA on the designated primary after a primary reboot, we will end up with two primaries, both activating the VIP using gratuitous ARP. This must be avoided at all costs - thus HA activation must be done with care. It cannot be automated.
Clustering is a technique whereby we can use multiple NSO nodes for the data. Note: if all our data fits on one node, i.e all service data and all device data, we recommend not using clustering. Clustering is non-trivial to configure, harder to search for errors and also has some performance issues. The clustering has nothing to do with HA, it is solely a means of using multiple machines when our dataset is too big to fit in RAM one machine.
Communication between CFS node and RFS nodes is over SSH and the NETCONF interface. Thus the first thing we must do is to decide which user/password to use when service node establish the SSH connection to the RFS node. One good solution is to create a specific user solely for this purpose.
$ sudo adduser netconfncs --ingroup ncsadmin --disabled-login $ sudo passwd netconfncs ... $ id netconfncs uid=1001(netconfncs) gid=1001(ncsadmin) groups=1001(ncsadmin)
At service node level we need some cluster configuration
klacke@cfs-m% show cluster authgroup rfsnodes { default-map { remote-name netconfncs; remote-password $4$T8hh78koPrja9Hggowrl2A==; } }
Our complete cluster configuration looks as:
klacke@cfs-m> show configuration cluster remote-node rfs1 { address 10.147.41.254; port 2022; ssh { host-key-verification none; } authgroup rfsnodes; username netconfncs; } authgroup rfsnodes { default-map { remote-name netconfncs; remote-password $4$T8hh78koPrja9Hggowrl2A==; } }
Three important observations on this configuration:
-
The address of the remote node is the VIP of the two RFS nodes.
-
We turned off ssh host verification. If we need this, we must also make sure that the SSH keys of the two RFS nodes rfs-m1 and rfs-s1 are identical. Otherwise we'll not be able to connect over SSH after a fail over.
Testing the cluster configuration involves testing a whole chain of credentials. To do this we must add at least one managed device, the RFS node, on the CFS node.
To hook up this device to our cluster we need to add this
device to /devices/device
on CFS the
node(s). On the CFS node acting as HA primary,
cfs-m1, we add:
klacke@cfs-m1> show configuration devices authgroups { group default { default-map { remote-name admin; remote-password $4$QMl45chGWNa5h3ggowrl2A==; } } } device rfs1 { lsa-remote-node rfs1; authgroup default; device-type { netconf { ned-id lsa-netconf; } } state { admin-state unlocked; } }
The device rfs1 has username
admin with password
admin. The lsa-remote-node
leaf points to the node named rfs1 in the
cluster configuration and will use its IP address, the VIP of
the RFS node HA-pair, and port. At this point we can read the
data on device rfs1 all the way from the
top level service node cfs-m.
The HA secondary service node is read-only, and all data is
available from there too.
When trouble shooting the cluster setup, it's a good idea to have cluster tracing turned on at the CFS nodes. On the CFS node:
klacke@cfs-m% set cluster remote-node rfs1 trace pretty
Upgrading the NSO software gives you access to new features and product improvements. Unfortunately, every change presents some risk and upgrades are not an exception. To minimize the risk and make the upgrade process as painless as possible, this section describes the recommended procedures and practices to follow during an upgrade. As usual, sufficient preparation avoids many of the pitfalls and makes the process more straightforward and less stressful.
There are multiple aspects that you should consider before starting with the actual upgrade procedure. While the development team tries to provide as much compatibility between software releases as possible, all incompatible changes cannot always be avoided. For example, when a deviation from an RFC standard is found and resolved, it may break clients that depend on the non-standard behavior. For this reason, a distinction is made between a maintenance and a major NSO upgrade.
A maintenance NSO upgrade is within the same branch, that is, when the first two version numbers stay the same (x.y in the x.y.z NSO version). An example is upgrading from version 5.6.1 to 5.6.2. In the case of a maintenance upgrade, the NSO release contains only corrections and minor enhancements, minimizing the changes. It includes binary compatibility for packages, which means there's no need to recompile the .fxs files for a maintenance upgrade.
Correspondingly, when the first or second number in the version changes, that is called a full or major upgrade. For example, upgrading version 5.6.1 to 5.7 is a major, non-maintenance upgrade. Due to new features, packages must be recompiled and some incompatibilities could manifest.
In addition to the above, a package upgrade is when you replace a package, such as a NED or a service package, with a newer version. Sometimes, when package changes aren't too big, it's possible to supply the new packages as part of the NSO upgrade, but this approach brings additional complexity. Instead, package upgrade and NSO upgrade should in general be performed as separate actions and are covered as such.
To avoid surprises during any upgrade, first ensure the following:
-
Hosts have sufficient disk space, as some additional space is required for an upgrade.
-
The software is compatible with the target OS. For example, sometimes a newer version of Java or system libraries, such as glibc, may be required.
-
All the required NEDs and custom packages are compatible with the target NSO version.
-
Existing packages have been compiled for the new version and are available to you during upgrade.
-
Check whether the existing
ncs.conf
file can be used as-is or needs updating. For example, stronger encryption algorithms may require you to configure additional keying material. -
Review the
CHANGES
file for information on what has changed. -
If upgrading from a no longer supported version of software, verify that the upgrade can be performed directly. In situations, where the currently installed version is multiple years old you may have to upgrade to one or more intermediate versions first, before you can upgrade to the target version.
In case it turns out any of the packages are incompatible or cannot be simply compiled again, you will need to contact the package developers for an updated or recompiled version. For an official Cisco-supplied package, it's recommended that you always obtain a pre-compiled version if it is available for the target NSO release, instead of compiling the package yourself.
Additional preparation steps may be required based on the upgrade and the actual setup, such as when using the Layered Service Architecture (LSA) feature. In particular, for a major NSO upgrade in a multi-version LSA cluster, ensure that the new version supports the other members of the cluster and follow the additional steps outlined in Setting up LSA deployments in NSO Layered Service Architecture.
If you use the High Availability (HA) feature, the upgrade consists of multiple steps on different nodes. To avoid mistakes, you are encouraged to script the process, for which you will need to set up and verify access to all NSO instances with either ssh, nct, or some other remote management command.
Please be aware that NSO 5 introduced major changes in device model handling. See the NSO CDM Migration Guide if upgrading from a previous release.
Likewise, NSO 5.3 added support for 256 bit AES encrypted
strings, requiring the AES256CFB128 key in the
ncs.conf
configuration.
You can generate one with the openssl rand -hex 32
or similar command. Alternatively, if you use an external command
for providing keys, make sure that it includes a value for an
AES256CFB128_KEY
in the output.
Finally, regardless of the upgrade type, make sure that you have a working backup and can easily restore the previous configuration if needed, as described in the section called “Backup and restore”.
Caution
The ncs-backup (and consequently the
nct backup) command does not back up the
/opt/ncs/packages
folder. If you make any
changes to files there, make sure you back them up separately.
The recommended approach, however, is to never modify packages in that folder. If an upgrade requires package recompilation, separate package folders (or files) should be used, one for each NSO version.
The upgrade of a single NSO instance requires the following steps:
-
Create a backup.
-
Perform a system install of the new version.
-
Stop the old NSO server process.
-
Update the
/opt/ncs/current
symbolic link. -
If required, update the
ncs.conf
configuration file. -
Update the packages in
/var/opt/ncs/packages/
if recompilation is needed. -
Start the NSO server process, instructing it to reload the packages.
The following steps suppose that you are upgrading to the 5.7 release. They pertain to a system install of NSO and you must perform them with Super User privileges. As a best practice, always create a backup before trying to upgrade.
# ncs-backup
For the upgrade itself, you must first download to the host and install the new NSO release.
# sh nso-5.7.linux.x86_64.installer.bin --system-install
Then, you stop the currently running server with the help of the init.d script or an equivalent command, relevant to your system.
# /etc/init.d/ncs stop
Stopping ncs: .
Next, you update the symbolic link for the currently selected version to point to the newly installed one, 5.7 in this case.
#cd /opt/ncs
#rm -f current
#ln -s ncs-5.7 current
While seldom necessary, at this point you would also update the
/etc/ncs/ncs.conf
file.
Now, make sure that the /var/opt/ncs/packages/
directory has packages that are appropriate for the new version.
For a maintenance upgrade, it should be possible to continue using
the same packages. But for a major upgrade, you must normally
rebuild the packages or use packages pre-built for the new version.
It is very important that you ensure this directory contains the
exact same version of each existing package, compiled for the new
release, and nothing else.
As a best practice, the available packages are kept in
/opt/ncs/packages/
and
/var/opt/ncs/packages/
only contains symbolic
links.
In this case, to identify the release for which they were compiled
for, the package file names all start with the corresponding
NSO version.
Then, you only need to rearrange the symbolic links in the
/var/opt/ncs/packages/
directory.
#cd /var/opt/ncs/packages/
#rm -f *
#for pkg in /opt/ncs/packages/ncs-5.7-*; do ln -s $pkg; done
Please note that the above package naming scheme is neither required nor enforced. If your package filesystem names differ from it, you will need to adjust the preceding command accordingly.
Finally, you start the new version of the NSO server with the package reload flag set.
# /etc/init.d/ncs start-with-package-reload
Starting ncs: .
NSO will perform the necessary data upgrade automatically.
If you have changed or removed any packages, this process may fail.
In that case, ensure that the correct versions of all packages are
present in /var/opt/ncs/packages/
and retry
the preceding command.
Also note that with many packages or data entries in the CDB this process could take more than 90 seconds and result in the following error message being reported:
Starting ncs (via systemctl): Job for ncs.service failed because a timeout was exceeded. See "systemctl status ncs.service" and "journalctl -xe" for details. [FAILED]
That does not imply that NSO failed to start, just that it took longer than 90 seconds. It is recommended you wait some additional time before verifying.
It is imperative you have a working copy of data available from which you can restore. That is why you must always create a backup before starting an upgrade. Only a backup guarantees that you can rerun the upgrade or back out of it, should it be necessary.
The same steps can also be used to restore data on a new, similar host if OS of the initial host becomes corrupted beyond repair.
First, stop the NSO process if it is running.
# /etc/init.d/ncs stop
Stopping ncs: .
Verify and, if necessary, revert the symbolic link in
/opt/ncs/
to point to the initial NSO
release.
#cd /opt/ncs
#ls -l current
#ln -s ncs-
VERSION
current
In the exceptional case where the initial version installation was removed or damaged, you will need to re-install it first and redo the step above.
Verify if the correct (initial) version of NSO is being used.
# ncs --version
Next, restore the backup.
# ncs-backup --restore
Finally, start the NSO server and verify the restore was successful.
# /etc/init.d/ncs start
Starting ncs: .
Upgrading NSO in a highly available (HA) setup is a staged process. It entails running various commands across multiple NSO instances at different times.
The procedure is almost the same for a maintenance and major NSO upgrade. The difference is that a major upgrade requires the replacement of packages with recompiled ones. Still, a maintenance upgrade is often perceived as easier because there are fewer changes in the product.
In addition, this same process can also be used for only upgrading the packages.
The stages of the upgrade are:
-
First enable read-only mode on the designated primary, and then on the secondary that is enabled for fail-over.
-
Take a full backup on all nodes.
-
Disconnect the HA pair by disabling HA on the designated primary, then temporarily promote the designated secondary to the actual primary (leader). This ensures the shared virtual IP address (VIP) fails-over and comes back up on the designated secondary as soon as possible, avoiding the automatic reconnect attempts.
-
Upgrade the designated primary.
-
Disable HA on the designated secondary node, to allow designated primary to become actual primary in the next step.
-
Activate HA on the designated primary, which will assume the assigned (primary) role, again providing the service through the shared VIP. However, at this point, the system is still without HA.
-
Upgrade the designated secondary node.
-
Activate HA on the designated secondary, which will assume its assigned (secondary) role, connecting HA again.
-
Verify that HA is operational and has converged.
The main thing to note is that all packages must match the NSO release. If they do not, the upgrade will fail.
In the case of a major upgrade, you must recompile the packages
for the new version.
It is highly recommended that you use pre-compiled packages and
do not compile them during this upgrade procedure since the
compilation can prove nontrivial, and the production hosts may
lack all the required (development) tooling.
You should use a naming scheme to distinguish between packages compiled
for different NSO versions.
A good option is for package file names to start with the
ncs-
prefix for a given major NSO version. This ensures
multiple packages can co-exist in the
MAJORVERSION
-/opt/ncs/packages
folder, and
the NSO version they can be used with becomes obvious.
The following is a transcript of a sample upgrade procedure, showing the commands for each step described above, in a 2-node HA setup, with nodes in their initial designated state.
<switch to designated primary CLI>
admin@ncs#show high-availability status mode
high-availability status mode primary admin@ncs#high-availability read-only mode true
<switch to designated secondary CLI>
admin@ncs#show high-availability status mode
high-availability status mode secondary admin@ncs#high-availability read-only mode true
<switch to designated primary shell>
#ncs-backup
<switch to designated secondary shell>
#ncs-backup
<switch to designated primary CLI>
admin@ncs#high-availability disable
<switch to designated secondary CLI>
admin@ncs#high-availability be-master
<switch to designated primary shell>
#<upgrade node>
#/etc/init.d/ncs restart-with-package-reload
<switch to designated secondary CLI>
admin@ncs#high-availability disable
<switch to designated primary CLI>
admin@ncs#high-availability enable
<switch to designated secondary shell>
#<upgrade node>
#/etc/init.d/ncs restart-with-package-reload
<switch to designated secondary CLI>
admin@ncs#high-availability enable
Scripting is a recommended way to upgrade the NSO version of an HA
cluster. The following example script shows the required commands and
can serve as a basis for your own customized upgrade script.
In particular, the script requires a specific package naming
convention above, and you may need to tailor it to your environment.
In addition, it expects the new release version and the designated
primary and secondary node
addresses as the arguments. The recompiled packages are read from the
packages-
directory.
MAJORVERSION
/
For the below example script we configured our primary and secondary nodes with their nominal roles that they assume at startup and when HA is enabled. Automatic failover is also enabled so that the secondary will assume the primary role if the primary node goes down.
<config xmlns="http://tail-f.com/ns/config/1.0"> <high-availability xmlns="http://tail-f.com/ns/ncs"> <ha-node> <id>n1</id> <nominal-role>master</nominal-role> </ha-node> <ha-node> <id>n2</id> <nominal-role>slave</nominal-role> <failover-master>true</failover-master> </ha-node> <settings> <enable-failover>true</enable-failover> <start-up> <assume-nominal-role>true</assume-nominal-role> <join-ha>true</join-ha> </start-up> </settings> </high-availability> </config>
#!/bin/bash set -ex vsn=$1 primary=$2 secondary=$3 installer_file=nso-${vsn}.linux.x86_64.installer.bin pkg_vsn=$(echo $vsn | sed -e 's/^\([0-9]\+\.[0-9]\+\).*/\1/') pkg_dir="packages-${pkg_vsn}" function on_primary() { ssh $primary "$@" ; } function on_secondary() { ssh $secondary "$@" ; } function on_primary_cli() { ssh -p 2024 $primary "$@" ; } function on_secondary_cli() { ssh -p 2024 $secondary "$@" ; } function upgrade_nso() { target=$1 scp $installer_file $target: ssh $target "sh $installer_file --system-install --non-interactive" ssh $target "rm -f /opt/ncs/current && \ ln -s /opt/ncs/ncs-${vsn} /opt/ncs/current" } function upgrade_packages() { target=$1 do_pkgs=$(ls "${pkg_dir}/" || echo "") if [ -n "${do_pkgs}" ] ; then cd ${pkg_dir} ssh $target 'rm -rf /var/opt/ncs/packages/*' for p in ncs-${pkg_vsn}-*.gz; do scp $p $target:/opt/ncs/packages/ ssh $target "ln -s /opt/ncs/packages/$p /var/opt/ncs/packages/" done cd - fi } # Perform the actual procedure on_primary_cli 'request high-availability read-only mode true' on_secondary_cli 'request high-availability read-only mode true' on_primary 'ncs-backup' on_secondary 'ncs-backup' on_primary_cli 'request high-availability disable' on_secondary_cli 'request high-availability be-master' upgrade_nso $primary upgrade_packages $primary on_primary '/etc/init.d/ncs restart-with-package-reload' on_secondary_cli 'request high-availability disable' on_primary_cli 'request high-availability enable' upgrade_nso $secondary upgrade_packages $secondary on_secondary '/etc/init.d/ncs restart-with-package-reload' on_secondary_cli 'request high-availability enable'
Once the script completes, it is paramount that you manually verify the outcome. First, check that the HA is enabled by using the show high-availability command on the CLI of each node. Then connect to the designated secondaries and ensure they have the complete latest copy of the data, synchronized from the primaries.
The described upgrade procedure is for an HA pair. The nodes are expected to have initially assigned (nominal) roles, and the procedure ensures that is the case at the end. For a 3-node consensus setup, first disable the HA on the third (non-fail-over) node, perform the described procedure, and finally upgrade the 3rd node as well.
After the primary node is upgraded and restarted, the read-only mode is automatically disabled. This allows the primary node to start processing writes, minimizing downtime. However, there is no HA. Should the primary fail at this point or you need to revert to a pre-upgrade backup, the new writes would be lost. To avoid this scenario, again enable read-only mode on the primary after re-enabling HA. Then disable read-only mode only after successfully upgrading and reconnecting the secondary.
To further reduce time spent upgrading, you can customize the script to install the new NSO release and copy packages beforehand. Then, you only need to switch the symbolic links and restart the NSO process to use the new version.
You can use the same script for a maintenance upgrade as-is, with
an empty
packages-
directory, or remove the MAJORVERSION
upgrade_packages
calls from the script.
Example implementations that use scripts to upgrade a 2- and 3-node setup
using CLI/MAAPI or RESTCONF are available in the NSO example set
under
examples.ncs/development-guide/high-availability
.
If you do not wish to automate the upgrade process, you will need
to follow the instructions from
the section called “Single Instance Upgrade” and transfer
the required files to each host manually. Additional information
on HA is available in High Availability. However, you
can run the high-availability
actions from the preceding
script on the NSO CLI as-is.
In this case, please take special care on which host you perform
each command, as it can be easy to mix them up.
Package upgrades are frequent and routine in development but require the same care as NSO upgrades in the production environment. The reason is that the new packages may contain an updated YANG model, resulting in a data upgrade process similar to version upgrade. So, if a package is removed or uninstalled and a replacement is not provided, package-specific data, such as service instance data, will be removed as well.
In a single node environment, the procedure is straightforward.
Create a backup with the ncs-backup command
and ensure the new package is compiled for the current NSO
version and available under the
/opt/ncs/packages
directory.
Then either manually rearrange the symbolic links in the
/var/opt/ncs/packages
directory or use the
software packages install command in the
NSO CLI.
Finally, invoke the packages reload command.
For example:
#ncs-backup
INFO Backup /var/opt/ncs/backups/ncs-5.7@2022-01-21T10:34:42.backup.gz created successfully #ls /opt/ncs/packages
ncs-5.7-router-nc-1.0 ncs-5.7-router-nc-1.0.2 #ncs_cli -C
admin@ncs#software packages install package router-nc-1.0.2 replace-existing
installed ncs-5.7-router-nc-1.0.2 admin@ncs#packages reload
>>> System upgrade is starting. >>> Sessions in configure mode must exit to operational mode. >>> No configuration changes can be performed until upgrade has completed. >>> System upgrade has completed successfully. reload-result { package router-nc-1.0.2 result true }
On the other hand, upgrading packages in an HA setup is a staged process. Broadly, it follows the same sequence of steps as upgrading the NSO and should be scripted for the same reasons. The difference is that you must explicitly uninstall the old packages and install the new ones.
Next you will find a description of an upgrade procedure for an HA pair. It is expected that all nodes are in their assigned (nominal) roles initially and the procedure ensures that's the case at the end. For a 3-node consensus setup, you must first disconnect the third (non-fail-over) node, perform the described procedure, and finally upgrade the 3rd node as well.
After backing up all the nodes and disabling writes, switch over the HA to the secondary node, allowing you to perform the necessary work on the primary. Having disabled the HA on the primary node, execute the following instructions on the primary.
Transfer the new packages into /opt/ncs/packages
with the help of the scp command or some other
way. Select the correct packages by manually rearranging the
symlinks in the /var/opt/ncs/packages
folder
or using the software packages install/deinstall
commands in the CLI.
Lastly, execute the packages reload command.
After verifying the node was successfully upgraded, switch the HA back to the upgraded primary, by disabling the HA on the designated secondary and enabling it on the designated primary. Then, repeat the transfer and upgrade of the packages on the designated secondary from the previous paragraph. Finally, reactivate the HA on the designated secondary and disable read-only mode.
The following example script codifies this procedure for an upgrade of a single package. Please customize it to your specific needs and environment.
#!/bin/bash set -ex primary=$1 secondary=$2 oldpkg=$3 newpkg=$4 function on_primary() { ssh $primary "$@" ; } function on_secondary() { ssh $secondary "$@" ; } function on_primary_cli() { ssh -p 2024 $primary "$@" ; } function on_secondary_cli() { ssh -p 2024 $secondary "$@" ; } on_primary_cli 'request high-availability read-only mode true' on_secondary_cli 'request high-availability read-only mode true' on_primary 'ncs-backup' on_secondary 'ncs-backup' on_primary_cli 'request high-availability disable' on_secondary_cli 'request high-availability be-master' scp ${newpkg}.tar.gz $primary:/opt/ncs/packages/ on_primary_cli "request software packages deinstall package ${oldpkg}" on_primary_cli "request software packages install package ${newpkg}" on_primary_cli 'request packages reload' on_secondary_cli 'request high-availability disable' on_primary_cli 'request high-availability enable' scp ${newpkg}.tar.gz $secondary:/opt/ncs/packages/ on_secondary_cli "request software packages deinstall package ${oldpkg}" on_secondary_cli "request software packages install package ${newpkg}" on_secondary_cli 'request packages reload' on_secondary_cli 'request high-availability enable'
You can extend the script to handle multiple packages in one go, making it more efficient. In that case, you should also consider using the request packages ha sync CLI command to further optimize the process. This command distributes all available packages from the current primary node to secondary nodes but does not install them. The command does not perform the sync on the node with none role.
The script uses the packages reload command to load new data models into NSO instead of restarting the server process. It is considerably more efficient and the time difference to upgrade can be considerable if the amount of data in CDB is huge.
In some cases, NSO may give warnings when the upgrade looks
"suspicious." For more information on this please see
the section called “Loading Packages”. If you understand
the implications and are willing to risk losing data, use the
force
option with packages
reload or set the NCS_RELOAD_PACKAGES
environment variable to force
when restarting
NSO. It will force NSO to ignore warnings and proceed
with the upgrade. In general, this is not recommended.
In addition, you must take special care for NED upgrades because services depend on them. Since NSO 5 introduced the CDM feature, which allows loading multiple versions of a NED, a major NED upgrade requires a procedure involving the migrate action.
When a NED contains nontrivial YANG model changes, that is called a major NED upgrade. The ned-id changes and also the first or the second number in the NED version changes, since NEDs follow the same versioning scheme as NSO. In this case, you cannot simply replace the package, as you would for a maintenance or patch NED release. Instead, you must load (add) the new NED package alongside the old one and perform the migration.
Migration uses the /ncs:devices/device/migrate
action
to change the ned-id of a single device or a group of devices.
It does not affect the actual network device, except possibly
reading from it. So, the migration does not have to be performed
as part of the package upgrade procedure described above but can
be done later, during normal operations.
The details are described in
the section called “NED Migration”.
Once the migration is complete, you can remove the old NED
by performing another package upgrade, where you
“deinstall” the old NED package. It can be done
straight after the migration or as part of the next upgrade
cycle.
NSO has the ability to install emergency patches during
runtime. These are delivered in the form of
.beam
files. You must copy the files into
the /opt/ncs/current/lib/ncs/patches/
folder
and load them with the ncs-state patches
load-modules command.
You already covered some of the logging settings that are possible to set in
ncs.conf
. All ncs.conf settings are described in the
man page for ncs.conf.
$ man ncs.conf .....
The NSO system install that you have performed on your 4 hosts also installs
good defaults for logrotate. Inspect
/etc/logrotate.d/ncs
and ensure that the settings
are what you want.
Note: The NSO error logs, i.e the files
/var/log/ncs/ncserr.log*
are internally rotated by NSO
and MUST not be rotated by logrotate
.
A crucial tool for debugging NSO installations are NED logs.
These logs are very verbose and are for debugging only.
Do not have these logs enabled in production.
Note that everything, including potentially sensitive data, is
logged. No filtering is done.
The NED trace logs are controlled through in CLI under:
/device/global-settings/trace
.
It's also possible to control the NED trace on a per device basis under
/devices/device[name='x']/trace
.
There are 3 different levels of trace, and for various historic reasons you usually want different settings depending on the device type.
-
For all CLI NEDs, you want to use the raw setting.
-
For all ConfD based NETCONF devices you want to use the pretty setting. ConfD sends the NETCONF XML unformatted, pretty means that you get the XML formatted.
-
For Juniper devices, you want to use the raw setting, Juniper sends sometimes broken XML that cannot be properly formatted, however their XML payload is already indented and formatted.
-
For generic NED devices - depending on the level of trace support in the NED itself, you want either pretty or raw.
-
For SNMP based devices, you want the pretty setting.
Thus, it's usually not good enough to just control the NED trace from
/devices/global-settings/trace
.
User application Java logs are written to /var/log/ncs/ncs-java-vm.log
.
The level of logging from Java code is controlled on a per Java package basis.
For example, if you want to increase the level of logging on e.g the
tailf-hcc code, you need to look into the code and find out the
name of the corresponding Java package.
Unpacking the tailf-hcc tar.gz package, you see in the
tailf-hcc/src/java/src/com/tailf/ns/tailfHcc/TcmApp.java
file that the package is
called com.tailf.ns.tailfHcc.
You can then do:
klacke@cfs-s% show java-vm java-logging logger com.tailf.ns.tailfHcc { level level-all; }
The internal NSO log resides at /var/log/ncs/ncserr.*
. The log is
written in a binary format, to view the internal error log run the following command:
$ ncs --printlog /var/log/ncs/ncserr.log
The nct get-logs command grabs all logs from all hosts. This is good when collecting data from the system.
All large-scale deployments employ monitoring systems. There are plenty of good tools to choose from. Open source and commercial. Examples are Cacti and Nagios. All good monitoring tools have the ability to script (using various protocols) what should be monitored. Using the NSO REST api is ideal for this. It is also recommended to set up a special read-only Linux user without shell access for this. The nct check command summarizes well what should be monitored.
The REST api can be used to view the NSO alarm table. NSO alarms are not events, whenever an NSO alarm is created - an SNMP trap is also sent (assuming that you have configured a proper SNMP target) All alarms require operator invention. Thus, a monitoring tool should also GET the NSO alarm table.
curl -k -u klacke:PASSW https://cfs-m:8888/api/operational/alarms/alarm-list -X GET
Whenever there are new alarms, an operator MUST take a look.
First, the presented configuration enables the built-in web server
for web UI and RESTCONF. It is paramount for security that you only
enable HTTPS access, with /ncs-config/webui/match-host-name
and /ncs-config/webui/server-name
properly set.
Second, the AAA setup described so far in this deployment document is the recommended AAA setup. To reiterate:
-
Have all users that need access to NSO in PAM, this may then be through
/etc/passwd
or whatever. Do not store any users in CDB. -
Given the default NACM authorization rules you should have three different types of users on the system.
-
Users with shell access that are members of the ncsadmin Linux group. These users are considered fully trusted. They have full access to the system as well as the entire network.
-
Users without shell access that are members of ncsadmin Linux group. These users have full access to the network. They can SSH to the NSO SSH shell, they can execute arbitrary REST calls etc. They cannot manipulate backups and perform system upgrades. If you have provisioning systems north of NSO, it is recommended to assign a user of this type for those operations.
-
Users without shell access that are members of ncsoper Linux group. These users have read-only access to the network. They can SSH to the NSO SSH shell, they can execute arbitrary REST calls etc. They cannot manipulate backups and perform system upgrades.
-
If you have more fine-grained authorization requirements than read-write all and read all,
additional Linux groups can be created and the NACM rules can be
updated accordingly.
Since the NACM are data model specific, you can see an example here.
Assume that you have a service that stores all data under /mv:myvpn
.
These services, once instantiated, manipulate the network. You want two new sets
of users, apart from the ncsoper and ncsadmin
users that you already have. You want one set of users that can read everything under
/mv:myvpn
,
and one set of users that can read-write everything there. They're not allowed
to see anything else in the system as a whole.
To accomplish this, it is recommend to do the following:
-
Create two new Linux groups. One called vpnread and one called vpnwrite.
-
Modify
/nacm
by adding to all 4 nodes:$ cat nacm.xml <nacm xmlns="urn:ietf:params:xml:ns:yang:ietf-netconf-acm"> <groups> <group> <name>vpnread</name> </group> <group> <name>vpnwrite</name> </group> </groups> <rule-list> <name>vpnwrite</name> <group>vpnwrite</group> <rule> <name>rw</name> <module-name>myvpn</module-name> <path>/myvpn</path> <access-operations>create read update delete</access-operations> <action>permit</action> </rule> <cmdrule xmlns="http://tail-f.com/yang/acm"> <name>any-command</name> <action>permit</action> </cmdrule> </rule-list> <rule-list> <name>vpnread</name> <group>vpnread</group> <rule> <name>ro</name> <module-name>myvpn</module-name> <path>/myvpn</path> <access-operations>read</access-operations> <action>permit</action> </rule> <cmdrule xmlns="http://tail-f.com/yang/acm"> <name>any-command</name> <action>permit</action> </cmdrule> </rule-list> </nacm> $ nct load-config --file nacm.xml --type xml
The above command will merge the data in
nacm.xml
on top of the already existing NACM data in CDB.
For a detailed discussion of the configuration of authorization rules through NACM, see The AAA infrastructure , in particular the section called “Authorization”.
A considerably more complex scenario is when you need/want to have users with
shell access to the host, but those users are either untrusted, or shouldn't
have any access to NSO at all.
NSO listens to a port called /ncs-config/ncs-ipc-address
, typically
on localhost.
By default this is 127.0.0.1:4569. The purpose of the port
is to multiplex several different access methods to NSO. The main security related point
to make here is that there are no AAA checks done on that port at all. If you have
access to the port, you also have complete access to all of NSO.
To drive this point home, when you invoke the ncs_cli command, that
is a small C program that connects to the port and tells NSO who
you are - assuming that authentication is already performed. There is even a
documented flag --noaaa which tells NSO to skip all NACM rules checks
for this session.
To cover the scenario with untrusted users with SHELL access, you must
protect the port. This is done through the use of a file in the Linux file system.
At install time, the file /etc/ncs/ipc_access
gets created and
populated with random data.
Enable /ncs-config/ncs-ipc-access-check/enabled
in ncs.conf
and ensure that trusted users can read the /etc/ncs/ipc_access
file
for example by changing group access to the file.
$ cat /etc/ncs/ipc_access cat: /etc/ncs/ipc_access: Permission denied $ sudo chown root:ncsadmin /etc/ncs/ipc_access $ sudo chmod g+r /etc/ncs/ipc_access $ ls -lat /etc/ncs/ipc_access $ cat /etc/ncs/ipc_access .......