Why NETCONF & YANG Done Right is Important

Introduction

Combining the strengths of NETCONF and an orchestrator such as Cisco Network Services Orchestrator (NSO) promises a programmable network where services are automatically provisioned by managing the configuration via network wide transactions. In this section, we explore why this is important to Service Providers, what Service Providers are striving towards, what it means in practice to physical and virtual network element providers (NEPs), and, finally, how network element providers can verify that their NETCONF based device meets the Service Automation Criteria and interoperates optimally with NSO.

Network wide transactions are initialized by NSO and extend all the way to the network elements where the configuration is implemented. All of the mandatory and the non-mandatory features of the IETF RFC (Request For Comments) standards based NETCONF protocol interface and YANG data models are implemented. YANG data models describe what the devices can do and are used to automate service provisioning in precise machine to machine transactions. In a network wide transaction, all configuration of the participating network elements in the provisioning of the service succeed and is committed, or nothing is set.

Now, let’s investigate a bit further how we can incorporate and make full use of this foundation in virtual and physical network elements in the context of NSO and a NETCONF server such as ConfD.

Why do Service Providers Want NETCONF and YANG?

In addition to the general trend that human driven science and technology are progressing forward at an increasingly rapid pace and what was new yesterday is obsolete and provides little value today or tomorrow, why should a Service Provider even bother to jump on some new way of managing their networks? What are the benefits and opportunities? Or is it just about trying to sustain the current value provided to customers?

The key market trends seem to be tied to, or at least influenced by, the previously mentioned overall direction of humanity. This manifests itself for Service Providers in at least three ways:

Execution at the speed of software – Software is playing an increasingly important role for services and product demand today. Take, for example, the alarm clock. The first adjustable mechanical alarm clock was invented around 1850. Today, the alarm clock is, for most people, a software based app on general purpose hardware in a smartphone with much of its settings stored in a cloud based service, set through an AI assistant such as Siri, Alexa, Cortana, etc. Developing new service features and modify existing ones with an agile DevOps software development methodology and deploying to a cloud platform is already a firm requirement.
Changing customer behavior and new expectations – Not that long ago, ordering a new alarm clock or a newly released book was something that you went to the physical store to buy or order. Today, services and products are distributed on demand with the press of a button and can often be accessed instantaneously.
Rapidly changing business models – Remember when transactions occurred by exchanging products and services for money in the form of bills and coins? Money, arguably one of humanity's greatest inventions, is nowadays exchanged electronically. As services are provided as cloud services through virtualization and programmable networks, new ecosystems and value chains are created. Over-The-Top Co-opetition is competing with the provider's own service offering while at the same time adding value to the infrastructure service.

All of the above trends require successful, flexible automation to address two main business drivers:

Automated operations to reduce OPEX (Operating Expenses)
Service agility to reduce the time to complete the deployment of a service

Hence, to spend less and bring in more, i.e. add more of those precious numbers after, for example, the "$" or before the "SEK", we need to make some adjustments to our efficiency of delivering services and deliver them when they are most relevant and of maximum value to the consumer.

But the complexity of automating existing legacy equipment and network management infrastructure has made many automation initiatives a painful challenge. The lack of automation keeps the day-to-day management of rapidly growing complex networks carried out by network engineers in a world which is a mix of scripted and manual steps. This is error-prone with a growing backlog. Provisioning services and managing the service quality in such networks is being done in the dark without insight into the status of the service and, as a result, the customer experience suffers.

The growing complexity of networks need to be addressed through a common network API to interface to all network elements and services need to be abstracted through a common central API.

The key to network automation is making networking devices more programmable. Programmability is about configuring and managing network devices using software machine-to-machine, as opposed to manual human-to-machine tinkering via CLI. Programmability is key to managing and automating large-scale networks with network elements from multiple vendors, running multiple different operating systems, and even variants of OS within each vendor domain. CLIs are cumbersome to use, error-prone, and vary widely across different vendors and even between products from the same vendor. With programmability, network management and automation becomes an exercise in computer science, where network administrators manipulate data, not devices.

To address the network element configuration challenge, the Internet Engineering Task Force (IETF) has defined three key standards:

The NETCONF and RESTCONF APIs that make it easier for service providers to manage their multivendor networks.
A data modeling language, YANG, which is key to defining devices and services in a consistent, parsable manner.

One of the key advantages of NETCONF over SNMP or CLI is that it is transactional. Configuring a network element can involve multiple steps. Usually, these actions cannot be done partially as this would leave the device in an undefined state. If any step of the end-to-end provisioning fails (or is undertaken in the wrong order), there is a need to roll back, i.e., undo all previous actions, to revert to the original configuration. This requires extensive programming when transaction management is not supported, as is the case with a CLI. NETCONF, on the other hand, does support transactions. Service providers can be confident that either all the configurations in a sequence are applied or the entire update is rolled back automatically. Additionally, transactionality at the network element level makes it easier to implement network-wide transactions which involve multiple network elements when provisioning services across a network.

We will not go into details of RESTCONF in this document. RESTCONF, as well as when RESTCONF may be an option to NETCONF, was well covered by a previously released white paper called Inside RESTCONF.

YANG data models which describe the configuration and state information of network elements and services are transported via the NETCONF protocol. The configuration is plain text and human readable/writable. YANG data models are easily parsed by a computer, unlike CLI. This is key to enabling programmability.

The NSO Network Element Driver (NED)

The majority of existing devices in current networks do not (yet) speak NETCONF. The most common way to configure network devices is through the CLI. Management systems typically connect over SSH to the CLI of the device and issue a series of CLI configuration commands. Some devices do not even have a CLI, and thus SNMP, or, even worse, a proprietary protocol is used to configure the device.

If your network element already provides a CLI through a legacy integration or ConfD, why would you need to support NETCONF if NSO can adapt to your CLI by the means of using a NED, i.e. an adapter tailored write to and read from the most relevant parts of your CLI?

While NSO can speak southbound devices which do not support NETCONF, this is not entirely automatic like it is with NETCONF. Depending on the type of interface the device has for configuration, this will involve some programming of a NED that will then need to be maintained over time.

No NETCONF and no shared YANG data model means that there is NED code to maintain in an attempt to make an error prone human-to-machine interface serve as a machine-to-machine interface in a best effort manner. This error prone human part is what the Service Provider wants to move away from in order to fully automate operations to reduce OPEX and provide service agility on top of the device management to reduce time to complete the deployment of a service.

Can I Just Enable ConfD's NETCONF Interface and be Done?

While you can just write or integrate a NETCONF server such as ConfD, there are a few things that you need to keep in mind when you configure and integrate the NETCONF server as you want to follow best practices to support machine-to-machine communication, automating operations, be manageable by an orchestrator using a NETCONF client such as NSO, and take part in enabling the Service Provider to deploy services in network wide transactions.

Service Automation Criteria (SAC)

The Service Automation Criteria (SAC) was developed as a guide to NETCONF and YANG best practices. To be useful at all to a Service Provider hoping to automate their network using NETCONF and YANG, you must support at the very least:

NETCONF Base Operations – The NETCONF base protocol must be fully supported or it isn’t NETCONF. The base protocol provides operations to retrieve, configure, copy, and delete configuration datastores. Devices may support additional operations which they will advertise as capabilities in the NETCONF "hello" exchange. Devices, whether physical or virtual, must fully implement NETCONF 1.1 RFC 6241 :netconf:base:1.1 capability in order to properly support NETCONF interoperability and multi-vendor orchestration.
The base operations are: get, get-config, edit-config, copy-config, delete-config, lock, unlock, close-session, and kill-session. To lead by example as model citizens of the NETCONF and YANG community, ConfD and NSO support the mandatory and almost all of the non-mandatory parts in NETCONF RFC 6241 and YANG RFC 7950. Unfortunately, we have seen more than one NETCONF server implementation which fail to implement all of the core base functions such as not supporting the lock operation.
Transactional NETCONF – To take part in a fully automated network based on machine-to-machine communication you need to support ACID (Atomicity, Consistency, Isolation, and Durability) transactions. Supporting a two-phase commit mechanism through :rollback-on-error, :candidate, :confirmed-commit:1.1 and :validate:1.1 is a must.
Consistent edit-config – These simple rules are all a no brainer to you if you are on-board with standards-based network automation through NETCONF and YANG based machine-to-machine communication:
1. All configuration data must be editable through a NETCONF operation.
2. Proprietary NETCONF RPCs that make configuration changes are not something you want your Service Provider customer to have to implement with NSO or some other NETCONF client. If you believe that you really need one, does it really make sense from a Service Provider/NETCONF client/orchestrator (NSO) perspective?
3. An
  operation must change the configuration in accordance with the payload or fail with an error message.
  1. In case of failure, there must not be any change to the configuration on the device.
  2. In case of success, all of the payload changes must be implemented and there must not be any other changes of the configuration other than the ones prescribed in the edit-config payload. If the device makes other changes itself, then the controller/orchestrator will be out-of-sync and will not know that it's out-of-sync. This will lead to errors down the line. Devices which self-modify their own configuration are a very bad idea from the viewpoint of automation and orchestration.
    See and study the transaction manager presented in the next section of this document.
4. It must be possible to go from any valid configuration to any other valid configuration with a single edit-config operation containing only the minimal delta between the two. NSO does this job by calculating the minimal diff set between the new and old config to send only the minimal delta. This is not done by all NETCONF clients (managers) which can cause strange issues as conflicting commands are sent to the NETCONF server in the same transaction. This can lead to strange behaviors which is why NSO always only sends the minimal diff to devices. This is also why ConfD provides the minimal diff to applications subscribing to configuration changes. The validity of an upcoming configuration must depend on that configuration alone. Specifically, the validity must not depend on the currently running configuration, the presence of hardware, the phase of the moon, or any other condition that is not part of the configuration. Service Providers want to be able to (quickly!) load backups or other saved or computed configurations. This must "work" even if a line card has gone bad. ConfD’s validation points allow you to do any validation in code that you cannot fit into a YANG pattern, range, must expression etc. Implement your validation point code with the above in mind.
5. A minimal delta must only refer to any particular leaf once, i.e., it cannot first set a leaf to one value and then set it to another value. A transaction is a set of changes, not a sequence. Concepts such as first, then, last, before, and after are meaningless inside a transaction. A transaction can only give each leaf a single value. Devices will need to properly sequence the work to reach the desired configuration and cannot rely on operators to specify this order. Use priorities when registering for receiving configuration change events through ConfD’s CDB (Configuration DataBase) subscriber API.
NETCONF over SSH – No surprise here. The NETCONF RFCs require that NETCONF over SSH must be supported. Other transport protocols for NETCONF do exist, but are optional per the RFCs. Common industry practice is to use NETCONF over SSH and that is what ConfD and NSO use.
Defaults handling – The :with-defaults capability from RFC6243 must be implemented. Declaring how the device treats default values is essential for mutual understanding of what is being said between an orchestrator and device.
Standard models – Applicable standard YANG data models, such as IETF or OpenConfig, must be implemented. Using standardized YANG data models, possibly extending them with more functionality and control, is key to increased interoperability and lower service provider OPEX cost. There are other standards such as OpenConfig YANG data models. That is a good thing for the same reasons as using IETF standard models. But make sure that you do not expose or have NSO import YANG models with overlapping configuration. Choose one or the other YANG data model to represent that interface configuration. Otherwise, you will break Service Automation Criteria #2 "Consistent edit-config" where an operation must change the configuration in accordance with the payload. i.e. There must not be any other changes of the configuration than the ones prescribed in the edit-config payload. If the device itself makes other changes, the orchestrator/controller, will not know about them and will get out of sync. The orchestrator/controller, will likely not know that it's out of sync until the next transaction. This will lead to errors down the line.
Model Discovery – YANG data model discovery and download as defined in RFC6022 should be implemented. Being able to download YANG data models directly from the device makes it easier to get the right version of the models and may enable completely automatic device discovery. If implemented, NSO can retrieve the YANG data models directly from a ConfD enabled network element. All downloaded YANG data models must be free from YANG syntax errors, of course, and by implementing them with ConfD it is unlikely that NSO will have issues with those same YANG data models. See the section Building and Installing a NETCONF NED using the NETCONF NED Builder later in this document.
Events – NETCONF Event Notifications (RFC5277) are reliable and very informative to the manager about what is going on in the device. For example, using the built-in NETCONF notification event stream (RFC6022), NSO can get an event / alarm that the managed device configuration has been changed out-of-band and that NSO is now out of sync. ConfD has built-in support for RFCs 6022 and 5277. So, the NETCONF notification event stream is provided automatically as the NETCONF interface is enabled. The application can also, through ConfD, provide its own proprietary event notifications to the orchestrator/controller. The application generates the content for each notification and sends it to ConfD. ConfD, in turn, manages the stream subscriptions and distributes the notifications accordingly.
YANG – Use YANG to define configuration, operational status data, notifications, and actions according to RFC7950. Keep NETCONF in mind primarily when writing your YANG data models. ConfD automatically renders all the northbound management interfaces from these YANG data models. A default rendering of each interface is produced automatically, without any programming at all, solely based on the model. If a YANG data model is updated, the change is automatically reflected in all management interfaces. This is a great feature but at the same time can cause a conflict of interest.
Avoid being heavily influenced to structure your YANG data models around a legacy CLI interface. Service Providers have a responsibility here too to not let the CLI scripting team dictate the implementation of the YANG data models and push aside the need to design for service automation and efficient machine-to-machine communication. Think machine-to-machine, not human-to-machine communication, if you want to contribute to a successful programmable network.
YANG Backwards Compatibility – The YANG data model upgrade rules defined in RFC7950 section 11 should be followed. Again, think machine-to-machine, not human-to-machine communication. All deviations from the section 11 rules must be handled by a built-in automatic upgrade mechanism. You get help and APIs from both ConfD and NSO to handle this, but sticking to section 11 is of course your best option when you have deployed to a live network.

Network Wide Transactions

We have mentioned transactions and two-phase commits a few times already in this document. Because transactions of configuration data sit at the heart of and drive the automated intuitive network based on machine-to-machine communication, we will now dive into transactions, give you some background, and then visualize how the NSO and ConfD transaction managers collaborate. The intent here is to give an overview of how to make an application best integrated with ConfD to participate in network wide transactions resulting from services being deployed through NSO.

Transactions are useful things. They allow us to say these configuration updates either all happen together, or none of them happen. They are very useful when we are inserting configuration data into a NSO/ConfD CDB database. They let us update multiple tables/lists, leafs, etc. at once, knowing that if anything fails, everything gets rolled back, ensuring our data doesn’t get into an inconsistent state. Simply put, a transaction allows us to group together multiple different activities that take our system from one consistent state to another — everything works or nothing changes.

ConfD handles transactions through its transaction manager. One by one, the NETCONF edit-config transactions go through the transaction phases, as we will see in the diagram below, and the applications subscribing to configuration updates are notified according to the priorities that they registered with ConfD.

The NSO transaction manager is similar to ConfD’s, but instead of notifying applications of changes done by the manager, NSO issues network wide transactions to one or multiple network elements as part of deploying a service.

Network wide transactions span multiple device transactions within them, using the transaction manager to orchestrate the various transactions being done by underlying systems, e.g., ConfD enabled systems, each with their own transaction manager. Just as with a transaction to ConfD’s CDB, a network wide transaction tries to ensure that everything remains in a consistent state. However, in this case, it tries to do so across multiple different systems running in different processes, often communicating across network boundaries.

The most common algorithm for handling network wide distributed transactions is to use a two-phase commit. NSO and ConfD both implement this algorithm using the NETCONF :rollback-on-error, :candidate, :confirmed-commit:1.1 and :validate:1.1 capabilities. With a two-phase commit, first comes the NSO validate and then the prepare phase. This is where each participant (network element) in the network wide transaction tells the NSO transaction manager whether it thinks its local transaction can go ahead. If the transaction manager gets an OK from all participants, then it tells them all to go ahead and perform their commits.

A single abort is enough for the NSO transaction manager to send out a cancel, i.e. rollback, to all parties. This approach relies on all parties halting until the central coordinating process in NSO tells them to proceed. This means that we are vulnerable to outages which is where the ConfD transaction manager comes into play. If the NSO transaction manager goes down, the pending transactions never complete, but will be aborted and rolled back automatically by ConfD's transaction manager. The backend applications subscribing to configuration changes are never notified.

There is also the case of what happens if a commit fails after the devices have returned ok, i.e. a "CDB sync subscription", following a confirmed-commit where the applications have been notified of the configuration changes. This is where activating the previously stored checkpoint saved by ConfD after a confirmed-commit saves the day. If NSO decides to abort the commit or does not confirm the commit within a (NSO controlled) time, the ConfD enabled device will rollback the CDB running datastore automatically and provide the undo configuration to the applications subscribing to CDB configuration changes.

Now that you have gained a basic understanding of how network wide transactions work and why they are important to automated service deployment, you can, from this perspective, integrate ConfD with your applications to fulfill SAC items 1,2, and 3 with a ConfD and NSO setup.

Out-of-Band Changes

An out-of-band change is a human or system making changes directly to the network element without going through the orchestrator/controller. If your goal is to allow for your device to be deployed in a programmable, fully automated service deployment your goal must be to enable the Service Provider to allow preferably no out-of-band changes or at least controlled out-of-band changes. Uncontrolled out-of-band changes need to be prohibited in a programmable network where service deployment is fully automated.

No out-of-band changes – The orchestrator/controller is the single point of configuration authority for your network element. Your device being out-of-sync is an exception and considered an alarming event that will need Service Provider intervention. If you design for this environment, you allow for the highest automation maturity and the highest network configuration authenticity.
Controlled out-of-band changes – Some out-of-band changes, but always unrelated to the NSO (Service Provider) service. Configuration being provisioned through NSO writes (or reads) certain parts of the device configuration. Another system or human is configuring the same network element but writes to different parts of the configuration. The configuration changes performed out-of-band are known and will not change configuration that is written or read by NSO. NSO being out-of-sync with the device is accepted, since out-of-band changes are safe. NSO is configured to skip the sync check during service provisioning.
Uncontrolled out-of-band changes – Other systems and humans are performing unknown out-of-band changes to the same network element as NSO and overwrite configuration that an NSO service wants to write or has written. Uncontrolled out-of-band changes are considered an operational failure/incident and the service provider needs to use tools in NSO to identify and understand the root cause. Either a human or code needs to analyze what has happened, take a decision and perform some kind of error handling. The decision can only be made by someone/something with a full understanding of the use case and the engineering/ops policies in place.

Out-of-band changes is one of the things that, if not approached correctly, can have big effects on performance and behavior of the entire solution. Are you stuck in trying to allow the Service Provider to automate manual tasks or designing for automation from the ground up?