OUTDATED
OUTDATED

This documentation corresponds to an older version of the product, is no longer updated, and may contain outdated information.
Please access the latest versions from https://cisco-tailf.gitbook.io/nso-docs and update your bookmarks. OK

Alarm Manager

Alarm Manager Introduction

NSO embeds a generic alarm manager. It is used for managing NSO native alarms and can easily be extended with application specific alarms. Alarm sources can be notifications from devices, undesired states on services detected or anything provided via the Java API.

Figure 4. The Alarm Manager

The Alarm Manager has three main components:

Alarm List: a list of alarms in NSO. Each list entry represents an alarm state for a specific device, object within the device and an alarm type
Alarm Model: for each alarm type, you can configure the mapping to for example X.733 alarm standard parameters that are sent as notifications northbound
Operator Actions: actions to set operator states on alarms such as acknowledgement, and also actions to administratively manage the alarm list such as deleting alarms

The alarm manager is accessible over all northbound interfaces. A read-only view including an SNMP alarm table and alarm notifications is available in an SNMP Alarm MIB. This MIB is suitable for integration to SNMP based alarm systems.

In order to populate the alarm list there is a dedicated Java API. This API lets a developer add alarms, change states on alarms etc. A common usage pattern is to use the SNMP notification receiver to map a subset of the device traps into alarms.

Overview of the alarm concepts

First of all it is important to clearly define what an alarm means. “An alarm denotes an undesirable state in a resource for which an operator action is required. ” Alarms are often confused with general logging and event mechanisms, thereby overflooding the operator with alarms. In NSO, the alarm manager shows undesired resource states that an operator should investigate. NSO contains other mechanisms for logging in general. Therefore, NSO does not naively populate the alarm list with traps received in the SNMP notification receiver.

Before looking into how NSO handles alarms it is important to define the fundamental concepts. We make a clear distinction between alarms and events in general. Alarms should be taken seriously and be investigated. Alarms have states, they go active with a specific severity, they change severity and they are cleared by the resource. The same alarm may become active again. A common mistake is to confuse the operator view with the resource view. The model described so far is the resource view. The resource itself may consider the alarm cleared. The alarm manager does not automatically delete cleared alarms. An alarm that has existed in the network may still need investigation. There are dedicated actions an operator can use to manage the alarm list, for example delete alarms based on criterias such as cleared and date. These actions can be performed over all north-bound interfaces.

Rather than viewing alarms as a list of alarm notifications NSO defines alarms as states on objects. The NSO alarm list uses four keys for alarms: the alarming object within a device and the alarm type and an optional specific problem. Alarm types are normally unique identifiers for a specific alarm state and are defined statically. An alarm type corresponds to the well-known X.733 alarm standard tuple event type, and probable cause. Specific problem is an optional key that is string based and can further redefine an alarm type at run-time. This is needed for alarms that are not known before a system is deployed. Imagine a system with general digital inputs. A MIB might specify traps called input-high, input-low. When defining the SNMP notification reception, an integrator might define an alarm type called "External-Alarm". Input-high might imply a major alarm and input-low might imply clear. At installation some detectors report "fire-alarm" and some "door-open" alarms. This is configured at the device and sent as free text in the SNMP var-binds. This is then managed by using the specific problem field of the NSO alarm manager to separate this different alarm types.

The data model for the alarm-manager is outlined in Figure 5, “Alarm Model”

Figure 5. Alarm Model

This means that we have a list with key: (managed device, managed object, alarm type, specific problem). In the example above we might have the following different alarms:

Device : House1; Managed Object : Detector1; Alarm-Type : External Alarm; Specific Problem = Smoke;
Device : House1; Managed Object : Detector2; Alarm-Type : External Alarm; Specific Problem = Door Open;

Each alarm entry shows the last status change for the alarm and also a child list with all status changes sorted in chronological order.

is-cleared: was the last state change clear?
last-status-change: time stamp for last status change
last-perceived-severity: last severity (not equal to clear)
last-alarm-text: the last alarm text (not equal to clear)
status-change, event-time: the time reported by the device
status-change, received-time: the time the state change was received by NSO
status-change, perceived-severity: the new perceived severity
status-change, alarm-text: descriptive text associated with the new alarm status

It is fundamental to define alarm types (specific problem) and managed objects with a fine-grained mechanism that still is extensible. For objects we allow YANG instance-identifiers to refer to a YANG instance identifier, an SNMP OID, or a string. Strings can be used when the underlying object is not modelled. We use YANG identities to define alarm types. This has the benefit that alarm types can be defined in a named hierarchy and thereby provide an extensible mechanism. In order to support "dynamic alarm types" so that alarms can be separated by information only available at run-time the string based field specific problem can also be used.

So far we have described the model based on the resource view. It is common practice to let operators manipulate the alarms corresponding to the operators investigation. We clearly separate the resource and the operator view, for example, there is no such thing as an operator "clearing an alarm". Rather the alarm entries can have a corresponding alarm handling state. Operators may want to acknowledge an alarm, set the alarm state to closed or similar.

We also support some alarm list administrative actions:

Synchronize alarms: try to read the alarm states in the underlying resources and update the alarm list accordingly (this action needs to be implemented by user code for specific applications)
Purge alarms: delete entries in the alarm list based on several different filter criteria
Filter alarms: with a XPATH as filter input, this action returns all alarms fulfilling the filter
Compress alarms: since every entry may contain a large amount of state change entries this action compresses the history to the latest state change

Alarms can be forwarded over NSO northbound interfaces. In many telecom environments alarms need to be mapped to X.733 parameters. We provide an alarm model where every alarm type is mapped to the corresponding X.733 parameters such as event type and probable cause. In this way, it is easy to integrate the NSO alarms into whatever X.733 enumerated values the upper fault management system is requiring.

The Alarm Model

The central part of the YANG Alarm model, tailf-ncs-alarms.yang has the following structure.

Example 39. tailf-ncs-alarms.yang

module tailf-ncs-alarms {

  namespace "http://tail-f.com/ns/ncs-alarms";
  prefix "al";
  ...
 typedef managed-object-t {
    type union {
      type instance-identifier {
        require-instance false;
        }
      type yang:object-identifier;
      type string;
    }


  ...
  typedef event-type  {
    type enumeration {
      enum other {value 1;}
      enum communicationsAlarm {value 2;}
      enum qualityOfServiceAlarm {value 3;}
      enum processingErrorAlarm {value 4;}
      enum equipmentAlarm {value 5;}
      ...
    }
    description
    "...";
    reference
    "ITU Recommendation X.736, 'Information Technology - Open
     Systems Interconnection - System Management: Security
     Alarm Reporting Function', 1992";
  }

  typedef severity-t  {
    type enumeration {
      enum cleared {value 1;}
      enum indeterminate {value 2;}
      enum critical {value 3;}
      enum major {value 4;}
      enum minor {value 5;}
      enum warning {value 6;}
    }
    description
      "...";
  }
  ...
  identity alarm-type {
    description
    "Base identity for alarm types."
    ...
  }

  identity ncs-dev-manager-alarm {
    base alarm-type;
  }

  identity ncs-service-manager-alarm {
    base alarm-type;
  }

  identity connection-failure {
    base ncs-dev-manager-alarm;
    description
      "NCS failed to connect to a device";
  }
  ....
  container alarm-model {
    list alarm-type {
      key "type";
        leaf type {
          type alarm-type-t;
        }

        uses alarm-model-parameters;
     }
  }

      ...


    container alarm-list {
      config false;
      leaf number-of-alarms {
        type yang:gauge32;
      }

      leaf last-changed {
        type yang:date-and-time;
      }

      list alarm {
        key "device type managed-object specific-problem";
        uses common-alarm-parameters;
        leaf is-cleared {
          type boolean;
          mandatory true;
        }

        leaf last-status-change {
          type yang:date-and-time;
          mandatory true;
        }

        leaf last-perceived-severity {
          type severity-t;
        }

        leaf last-alarm-text {
          type alarm-text-t;
        }

        list status-change {
          key event-time;
          min-elements 1;
          uses alarm-state-change-parameters;
        }

        leaf last-alarm-handling-change {
          type yang:date-and-time;
        }

        list alarm-handling {
          key time;
          leaf time {
            tailf:info "Time stamp for operator action";
            type yang:date-and-time;
          }
          leaf state {
            tailf:info "The operators view of the alarm state";
            type alarm-handling-state-t;
            mandatory true;
            description
              "The operators view of the alarm state.";
          }
          ...
        }
        ...
        notification alarm-notification {
        ...
        rpc synchronize-alarms {
        ...
        rpc compress-alarms {
        ...
        rpc purge-alarms {

The first part of the YANG listing above shows the definition for managed-object type in order for alarms to refer to YANG, SNMP and other resources. We also see basic definitions from the X.733 standard for severity levels.

Note well the definition of alarm-type using YANG identities. In this way we can create a structured alarm type hierarchy all rooted at alarm-type. In order for you to add your specific alarm types, define your own alarm types YANG file and add identities using alarm-type as base.

The alarm-model container contains the mapping from alarm types to X.733 parameters used for north-bound interfaces.

The alarm-list container is the actual alarm list where we maintain a list mapping (device, managed-object, alarm-type, specific-problem) to the corresponding alarm state changes [(time, severity, text)].

Finally, we see the northbound alarm notification and alarm administrative actions.

Alarm handling

The NSO alarm manager has support for the operator to acknowledge alarms. We call this alarm handling. Each alarm has an associated list of alarm handling entries as:

container alarms {
  ....
  container alarm-list {
    config false;
    ....
    list alarm {
      key "device type managed-object specific-problem";

      .....

      list alarm-handling {
        key time;
        leaf time {
          type yang:date-and-time;
          description
            "Time-stamp for operator action on alarm.";
        }
        leaf state {
          mandatory true;
          type alarm-handling-state-t;
          description
            "The operators view of the alarm state";
        }
        leaf user {
          description "Which user has acknowledged this alarm";
          mandatory true;
          type string;
        }
        leaf description {
          description "Additional optional textual information regarding
            this new alarm-handling entry";
          type string;
        }
      }

        tailf:action handle-alarm {
          tailf:info "Set the operator state of this alarm";
          description
            "An action to allow the operator to add an entry to the
             alarm-handling list. This is a means for the operator to indicate
             the level of human intervention on an alarm.";
          input {
            leaf state {
              type alarm-handling-state-t;
              mandatory true;
            }
          }
        }
      }

The following typedef defines the different states an alarm can be set into.

Example 40. Alarm state

  typedef alarm-handling-state-t  {
    type enumeration {
      enum none {
        value 1;
      }
      enum ack {
        value 2;
      }
      enum investigation {
        value 3;
      }
      enum observation {
        value 4;
      }
      enum closed {
        value 5;
      }
    }
    description
      "Operator actions on alarms";
  }

It is of course also possible to manipulate the alarm handling list from either Java code or Javascript code running in the web browser using the js_maapi library.

Below follows a simple scenario to illustrate the alarm concepts. The example can be found in examples.ncs/service-provider/simple-mpls-vpn

$ make stop clean all start
$ ncs-netsim stop pe0
$ ncs-netsim stop pe1
$ ncs_cli -u admin -C
admin connected from 127.0.0.1 using console on host
admin@ncs# devices connect
...
connect-result {
    device pe0
    result false
    info Failed to connect to device pe0: connection refused
}
connect-result {
    device pe1
    result false
    info Failed to connect to device pe1: connection refused
}
...
admin@ncs# show alarms alarm-list
alarms alarm-list number-of-alarms 2
alarms alarm-list last-changed 2015-02-18T08:02:49.162436+00:00
alarms alarm-list alarm pe0 connection-failure /devices/device[name='pe0'] ""
 is-cleared              false
 last-status-change      2015-02-18T08:02:49.162734+00:00
 last-perceived-severity major
 last-alarm-text         "Failed to connect to device pe0: connection refused"
 status-change 2015-02-18T08:02:49.162734+00:00
  received-time      2015-02-18T08:02:49.162734+00:00
  perceived-severity major
  alarm-text         "Failed to connect to device pe0: connection refused"
alarms alarm-list alarm pe1 connection-failure /devices/device[name='pe1'] ""
 is-cleared              false
 last-status-change      2015-02-18T08:02:49.162436+00:00
 last-perceived-severity major
 last-alarm-text         "Failed to connect to device pe1: connection refused"
 status-change 2015-02-18T08:02:49.162436+00:00
  received-time      2015-02-18T08:02:49.162436+00:00
  perceived-severity major
  alarm-text         "Failed to connect to device pe1: connection refused"

In the above scenario we stop two of the devices and then ask NSO to connect to all devices. This results in two alarms for pe0 and pe1. Note that the key for the alarm is the devicename, the alarm-type and the full path to the object (in this case the device and not an object within the device) and finally an empty string for specific problem.

In the next command sequence we start the device and request NSO to connect. This will clear the alarms.

admin@ncs# exit
$ ncs-netsim start pe0
DEVICE pe0 OK STARTED
$ ncs-netsim start pe1
DEVICE pe1 OK STARTED
$ ncs_cli -u admin -C
$ admin@ncs# devices connect
...
connect-result {
    device pe0
    result true
    info (admin) Connected to pe0 - 127.0.0.1:10028
}
connect-result {
    device pe1
    result true
    info (admin) Connected to pe1 - 127.0.0.1:10029
}
...
admin@ncs# show alarms alarm-list
alarms alarm-list number-of-alarms 2
alarms alarm-list last-changed 2015-02-18T08:05:04.942637+00:00
alarms alarm-list alarm pe0 connection-failure /devices/device[name='pe0'] ""
 is-cleared              true
 last-status-change      2015-02-18T08:05:04.942637+00:00
 last-perceived-severity major
 last-alarm-text         "Failed to connect to device pe0: connection refused"
 status-change 2015-02-18T08:02:49.162734+00:00
  received-time      2015-02-18T08:02:49.162734+00:00
  perceived-severity major
  alarm-text         "Failed to connect to device pe0: connection refused"
 status-change 2015-02-18T08:05:04.942637+00:00
  received-time      2015-02-18T08:05:04.942637+00:00
  perceived-severity cleared
  alarm-text         "Connected as admin"
alarms alarm-list alarm pe1 connection-failure /devices/device[name='pe1'] ""
 is-cleared              true
 last-status-change      2015-02-18T08:05:04.84115+00:00
 last-perceived-severity major
 last-alarm-text         "Failed to connect to device pe1: connection refused"
 status-change 2015-02-18T08:02:49.162436+00:00
  received-time      2015-02-18T08:02:49.162436+00:00
  perceived-severity major
  alarm-text         "Failed to connect to device pe1: connection refused"
 status-change 2015-02-18T08:05:04.84115+00:00
  received-time      2015-02-18T08:05:04.84115+00:00
  perceived-severity cleared
  alarm-text         "Connected as admin"

Note that there are two status-change entries for the alarm and that the alarm is cleared. In the following scenario we will state that the alarm is closed and finally purge (delete) all alarms that are cleared and closed. (Again note the distinction between operator states and the states from the underlying resources.)

admin@ncs# alarms alarm-list alarm pe0 connection-failure /devices/device[name='pe0']
          "" handle-alarm state closed description Fixed

admin@ncs# show alarms alarm-list alarm alarm-handling

DEVICE  TYPE                 STATE   USER   DESCRIPTION
---------------------------------------------------------
pe0     connection-failure   closed  admin  Fixed

admin@ncs# alarms purge-alarms alarm-handling-state-filter { state closed }
Value for 'alarm-status' [any,cleared,not-cleared]: cleared
purged-alarms 1

Assume you need to configure the northbound parameters. This is done using the alarm-model. A logical mapping of the connection problem above is to map it to X.733 probable cause connectionEstablishmentError (22) . This is done in the NSO CLI in the following way:

admin@ncs# config
Entering configuration mode terminal
admin@ncs(config)# alarms alarm-model alarm-type connection-failure probable-cause 22
admin@ncs(config-alarm-type-connection-failure/*)# commit
Commit complete.
admin@ncs(config-alarm-type-connection-failure/*)# show full-configuration
alarms alarm-model alarm-type connection-failure *
 event-type     communicationsAlarm
 has-clear      true
 kind-of-alarm  root-cause
 probable-cause 22