Cisco Observability Platform

This documentation and the Cisco Observability Platform functionalities it describes are subject to change. Data saved on the platform may disappear and APIs may change without notice.

Introduction

Health rules allow you to specify conditions to monitor the health of entities defined in your solutions. The entities are resources that you need to monitor to determine the health of your solution. Health refers to the overall performance status such as good, critical, warning, or unknown of an entity or group of entities. For example, if you want to monitor the health of your Kubernetes® (K8s) environment, you can track the health of the entities such as clusters, namespace, workloads, pods, and so on. You can use health rules to detect if any metrics in the entities are deviating from normal behavior and to trigger alerts for any deviations.

As a solution developer, you can create health rules for your business entities based on the health rule schema. For example, if you develop a custom monitoring solution for the K8s infrastructure in your organization, you can create health rules for monitoring the health of the K8s entities such as clusters, namespace, workloads, and pods. When users subscribe to the K8s solution in their tenant, the health rules are applied to their tenants and the users receive alerts based on these health rules.

Key Concepts

You are required to understand the following concepts before creating health rules for your solution:

Entity Types and Entities

You need to specify the entity types that a health rule monitors. Entities are instances of entity types. For example, k8s: pod (Kubernetes namespace) is an entity type while o2-k8s-monitoring-otel-collector-lg29c (Kubernetes pod instance) is an instance of a k8s:pod. You define a health rule for an entity type while the health rules are evaluated at the entity level.

Health Rule Violation

A health violation occurs when the performance of an entity being monitored by the health rule violates the conditions set in the rule. The health statuses are represented as critical, warning, normal, not available (NA), and unknown.

A health violation event occurs when the health status of an entity changes. A few examples of health rule violation are:

Violation Started: Warning
Violation Started: Critical
Violation Upgraded: Warning to Critical
Violation Downgraded: Critical to Warning

Health Evaluation

The performance or the health of the entity is evaluated at:

The individual entity level (granular)—the alerts are triggered based on the performance of a single entity, for example, a service instance. Parent entity level (aggregation of a group of entities)—the alerts are triggered based on the aggregate performance of a group of entities, such as service instances grouped by the parent service. The aggregate performance is calculated based on the metric you select. Alerts are only triggered if the performance of all the grouped entities deteriorates. For example, the performance of all the service instances within the service deteriorates.

Health Rule Wait Time After Violation

The wait time after violation enables you to control how often a violation is generated while the conditions found to violate a health rule continue. If the health rule is violated, with a status of either Critical or Warning, a Violation Open: Critical or Violation Open: Warning event is generated. This event is used to initiate any required actions.

Once an Open event has occurred, the status of the health rule is evaluated every minute. If the same violation is detected, the violation remains open with the same status. A corresponding Violation Continues: Critical or Violation Continues: Warning event may be generated.

A Violation Continues event every minute might be too noisy for your health rule. The waitTimeAfterViolation field in the health rule schema is used to throttle how often these Continues events are generated for continuing health rule violations. The default is every 30 minutes.

Health Rollup

You can aggregate (rollup) the health of the entities to define the health of a group of entities. This means a child entity can define the health of a parent entity by rollup relationship. You can define the health rollup relationship between the entities and the parents using the rollupTo field in the health rule schema.

For example, if 40% of the k8s:namespaces are unhealthy, the k8s:cluster health is reported as unhealthy.

For more information about the rollup expressions, see Expression Language.

Health Rule Evaluation Conditions

A health rule condition is an acceptable performance range for an identified metric. A condition defines the metric levels that constitute a warning status or a critical status.

A condition consists of a boolean statement that compares the current value of a metric against one or more static or dynamic thresholds based on a selected baseline. If the condition is true, the health rule violates. You can configure the rules for evaluating a condition using multiple thresholds.

Static thresholds are straightforward. For example, is the Memory Utilization for a pod greater than 80%?

The condition is evaluated as true if the Memory Utilization is greater than 80%, the health rule violates. You can also select the source from which you want to query the data. The health evaluation varies depending on the data source you choose because metrics from different sources have different granularity and properties.

Dynamic thresholds are based on a percentage in relation to, or a standard deviation from, a baseline built on a rolled-up baseline trend pattern.

You can define a threshold for a health rule based on a single metric value or a mathematical expression built from multiple metric values.

The following are some examples of health rule conditions:

To know if there are pods with readiness/liveness issues affecting your services, define a condition:

Readiness probe status =0 for 80% pods in a workload Liveness probe status =0 for more than 30% pods in a workload
To know if any services are impacted by pod restarts, define a condition:

Pod Restarts are greater than 3 for 80% pods on a workload
To know about failed or pending pods, define a condition:

Sum of Failed pods over a workload is greater than 10% Sum of Pending Pods over a workload is greater than 10%

If the value of Errors per Minute/Calls per Minute over the last 15 days > 0.2. This example combines two metrics in a single condition.

If the (average response time > baseline OR errors per minute > baseline) AND (calls per minute > the defined threshold). This example uses multiple conditions to evaluate the health rules.

For more information about the evaluation conditions, see Expression Language.

Critical and Warning Conditions

Conditions are classified as either critical or warning.

Critical conditions are evaluated before warning conditions. If you have defined a critical condition and a warning condition in the same health rule, the warning condition is evaluated only if the critical condition is not true.

Health Rule Evaluation Granularity

A health rule monitors the metrics of an entity with a granularity in minutes. If there is any deviation in the metric values based on the health rule condition, an alert is triggered for the health rule violation after a delay of certain minutes.

The following table lists the default granularity and delay values based on the namespaces:

Namespace	Granularity	Delay
Application Performance Monitoring (APM)	1 minute	4 minutes
Amazon Web Services	5 minutes	3 minutes
Google Cloud Platform	5 minutes	3 minutes
Microsoft Azure	5 minutes	10 minutes
All other namespaces	1 minute	3 minutes

For example, for an APM entity, a metric data reported at 11:00 AM is accepted until 11:03:59 AM (a delay of three minutes) for evaluation by the health rule. Any data reported on or after 11:04 AM for 11:00AM is dropped by the health rule. Similarly a metric data reported at 11:01 AM is accepted until 11:04:59 AM and so on.

You can modify the granularity and delay values at the namespace level and the entity level. This configuration is available in the healthrule:healthRuleScopeOverrides schema. However, the granularity and delay values must conform to one of the following enum types:

Enum Values	Description
`ONE_THREE`	1 minute granularity and 3 minutes delay
`ONE_FOUR`	1 minute granularity and 4 minutes delay
`FIVE_THREE`	5 minutes granularity and 3 minutes delay
`FIVE_TEN`	5 minutes granularity and 10 minutes delay

For example, you can modify the granularity and delay values of an APM entity from its default value ONE_FOUR (1 minute granularity and 4 minutes delay) to any one the preceding supported values. Any other values than the supported values are not allowed.

Points to consider:

The configuration specified at entity type level takes precedence and overrides the configuration specified at the namespace level.
If you don't specify the granularity and delay at the entity type level, then granularity and delay configured at the namespace level apply to all the entity types of the namespace.
If you don't specify the granularity and delay at either the namespace level or entity type level, the default configuration is considered.
If metric values of a namespace are delayed by more than their default values or the values specified by you in the template, the metric values are ignored during the health rule evaluation.

Persistence Thresholds

Temporary spikes in metric performance data are a major cause of false alerts. Persistence thresholds allow you to define a sensitivity level for a health rule and thereby reduce the number of false alerts. You can define the number of times metric performance data should exceed the defined threshold during the evaluation time frame to constitute a violation and subsequently trigger an alert.

For more information about the health rule concepts, see the Cisco Cloud Observability documentation on Entity Health Monitoring.