Cisco Observability Platform

This documentation and the Cisco Observability Platform functionalities it describes are subject to change. Data saved on the platform may disappear and APIs may change without notice.

Health Rules Schema and Expression Language

To create a health rule for your domain entities, you need to understand the schema of a health rule. Any health rule that you create must follow the specified schema.

The Health Rule platform component provides you with two types of schemas:

healthrule:healthRuleTemplate: Use this schema to create a health rule object file.
healthrule:healthRuleScopeOverrides: Use this schema to create a health rule override object file where you can specify the entity types that you want to override and customize the rollup paths that you want to monitor.

Expression Language

The health rule expression language helps you to define:

the entities to be evaluated.
the conditions to evaluate the entities.
the roll-up relationship for the entities, which controls the health of the parent entities based on the child entities.

Note: The health rule expression language mentioned in this document is version 2 (v2). The version 1 (v1) is no longer supported.

Evaluation Expression

In the topologyExpression field of the health rule template, you can define conditions to filter the specific entities that a health rule evaluates. The conditions are a subset of the Unified Query Language (UQL). For information about the UQL, see Unified Query Language User Guide.

The following table illustrates a few examples:

Evaluation Conditions	Evaluation Expression
Evaluate all pods in the k8s namespace.	`entities(k8s:pod)`
Evaluate all pods with the attribute k8s.pod.status is running.	`entities(k8s:pod)[attributes("k8s.pod.status")='Running']`
Evaluate all pods with the tag namespace `appdynamics`.	`entities(k8s:pod)[tags("namespace")='appdynamics']`
Evaluate all pods associated with the k8s deployment type. Note that the `out.to` operator is used to denote association among the entities. Currently, the health rule expression language supports only the following associations that are defined from parent to child entities: `ONE_TO_ONE`, `ONE_TO_MANY`, and all hierarchical associations.	`entities(k8s:deployment).out.to(k8s:pod)`
Evaluate all pods with the attribute k8s.pod.status = Running and associated with the k8s deployment type	`entities(k8s:deployment).out.to(k8s:pod)[attributes("k8s.pod.status")='Running']`
Evaluate all pods with the attribute k8s.pod.status = Running and is associated with the k8s deployment type with the namespace name appdynamics.	`entities(k8s:deployment)[attributes("k8s.namespace.name")='appdynamics'].out.to(k8s:pod)[attributes("k8s.pod.status")='Running']`
Evaluate all pods with the attribute k8s.pod.status = Running and is associated with the k8s deployment type with namespace name appdynamics and cluster name appd-cluster	`entities(k8s:deployment)[attributes("k8s.namespace.name")='appdynamics'].out.to(k8s:pod)[attributes("k8s.pod.status")='Running' && attributes("k8s.cluster-name")='appd-cluster']`
Evaluate pods from the abstract type workload, the workload is associated with the k8s namespace and the namespace is associated with cluster with the attribute name as appd-cluster.	`entities(k8s:cluster)[attributes("k8s.cluster.name")='appd-cluster'].out.to(k8s:namespace).out.to(k8s:workload).out.to(k8s:pod)`

The following sample shows how to define an evaluation expression in your health rule object file:

json

Copy"evaluationObjects":  {
  "topologyExpression": "entities('spacefleet:starship').out.to('spacefleet:torpedo_tube)",
  "evaluationEntityType": "spacefleet:torpedo_tube"
}

Dynamic Grouping in Evaluation Expression

In the evaluation condition, you can add filters to monitor the entity types based on the filter criteria. For example, if you want to monitor a specific torpedo that has the name Black Shark, use the following condition in the topologyExpression field:

json

Copy"evaluationObjects":  {
    "topologyExpression": "entities('spacefleet:starship').out.to('spacefleet:torpedo_tube)[attributes("name")='Black Shark']",
    "evaluationEntityType": "spacefleet:starship"
}

Note: Dynamics grouping is supported only for metrics expressions.

Rollup Expression

A child entity can define the health of the parent entity by rollup relationship. The health of a parent entity can be rolled up based on the child entity. The path of the child entity is defined in the evaluation expression.

For example, in the following topologyExpression:

entities(k8s:cluster)[attributes("k8s.cluster.name")='appd-cluster'].out.to(k8s:namespace).out.to(k8s:workload).out.to(k8s:pod

The relationship among the entities can be defined as:

k8s:cluster is the parent entity of k8s:namespace
k8s:namespace is the parent entity of k8s:workload
k8s:workload is the parent entity of k8s:pod

Alternatively, you can read the entity relationship as:

k8s:pod is the child entity of k8s:workload
k8s:workload is the child entity of k8s:namespace
k8s:namespace is the child entity of k8s:cluster

To define the health rollup relationship among the entities, you need to use the following format in your health rule object file:

json

Copy{
   "rollupHealthConfig":{
      "rollupTo":"k8s:workload",
      "criteria":{
         "threshold":50,
         "thresholdType":"PERCENTAGE"
      },
      "rollupHealthConfig":{
         "rollupTo":"k8s:namespace",
         "criteria":{
            "threshold":40,
            "thresholdType":"PERCENTAGE"
         },
         "rollupHealthConfig":{
            "rollupTo":"k8s:cluster",
            "criteria":{
               "threshold":2,
               "thresholdType":"COUNT"
            }
         }
      }
   }
}

The preceding sample indicates the following health rollup relationship:

If 50% of k8s:pod is unhealthy, then k8s:workload is unhealthy.
If 40% of k8s:workload is unhealthy, then k8s:namespace is unhealthy.
If two k8s:namespace entities are unhealthy, then k8s:cluster is unhealthy.

Boolean Condition Expression

A condition consists of a single or multiple statements that evaluate different metrics, events, and logs. When you define multiple conditions, you may want to define evaluation criteria using a boolean expression.

The advantages of using a boolean expression are:

eliminates the need to create multiple health rules to monitor various performance metrics. Using a boolean expression allows you to evaluate complex criteria for multiple conditions in one go.
well-calibrated boolean expression ensures reduced false alerts.
easy to create and maintain health rules with complex evaluation criteria using simple condition names. Conditions are named as A, B, C, and so on. allows the use of and or operators to define a highly complex boolean expression.

The following sample illustrates how to use a boolean expression in the criteriaExpression field:

Boolean "(A or B)":

json

Copy"criticalCriteria": {
    "conditions": [
      {
        "name": "matter level",
        "conditionExpression": "metrics('${manifest.name & manifest.tag}:matter.utilization', 'space')[timestamp > now - 1m].value() < 10",
        "label": "A",
        "evaluateToTrueOnNoData": false
      },
      {
        "name": "anti matter level",
        "conditionExpression": "metrics('${manifest.name & manifest.tag}:anti_matter.utilization', 'space')[timestamp > now - 1m].value() < 10",
        "label": "B",
        "evaluateToTrueOnNoData": false
      }
    ],
    "criteriaExpression": "(A or B)"
  }

Boolean with condition:

json

Copy{
    "name": "Events-pod-filter",
    "description": "",
    "enabled": true,
    "scheduleName": "Always",
    "waitTimeAfterViolation": "5m",
    "evaluationObjects": {
        "topologyExpression": "entities(k8s:pod)[attributes(k8s.cluster.id)='events-UUID-alerting-test']",
        "evaluationEntityType": "k8s:pod"
    },
    "criticalCriteria": {
        "criteriaExpression": "A",
        "conditions": [
            {
                "name": "Condition 1",
                "label": "A",
                "conditionExpression": "(events(\"k8s:native_event\")[source = 'infra-agent'])[timestamp > (now - 30m)] < 1000",
                "evaluateToTrueOnNoData": false
            }
        ]
    },
    "entityType": "k8s:pod",
    "createdAt": "2023-10-17T07:03:54Z",
    "updatedAt": "2023-10-17T08:19:36Z",
    "id": "4bef91a3-669b-4859-b4d5-37e3630ea455"
}

Leaf Condition

The leaf condition of a health rule configuration represents one of the conditions that must be evaluated. You can create leaf conditions for metrics, events, and logs.

The output of a leaf condition is:

0 if the condition evaluates to true
1 if the condition evaluates to false

If the required data is not available, then the unknown result is represented by -1.

Example:

Consider the following leaf conditions for metrics and their specified labels:

Condition	Label
`metrics("apm:calls_per_min", "apm"))[timestamp > (now - 30m)].value() > 10`	A
`metrics("apm:memory_limit", "apm")[timestamp > (now - 30m)].value() > 25`	B
`metrics("apm:memory_usage", "apm")[timestamp > (now - 30m)].value() > 15`	C

You can form a boolean expression by combining conditions such as (A and (B or C)).

Points to Consider:

You can form a boolean expression by combining multiple leaf conditions of the same type such as multiple metric-based conditions, event-based conditions, or log-based conditions. However, a boolean expression formed by combining an event-based condition, a metric-based condition, and a log-based condition is not supported.
You can't configure an event-based condition to trigger when a violation occurs x times in the last y minutes.

Supported Functions

You can use the following functions in your expressions:

Function	Description
`count()`	Returns the number of measurements of a metric data for a time period.
`max()`	Returns the maximum value of a metric data for a time period.
`min()`	Returns the minimum value of a metric data for a time period.
`stdDev()`	Returns the deviation of a metric data from the mean or average value.
`sum()`	Returns the sum of the metric values for a time period.
`value()`	Returns a reference to the underlying function of a metric category.
`baseline()`	Returns the value for the baseline configuration.
`stdDevRange()`	Returns a range of values for the baseline mean and standard deviation values. It takes the argument as an integer (standard deviation point) and applies it to the baseline function result.
`percentageRange()`	Returns a range of values for the baseline mean and standard deviation values. It takes the argument as an integer (percentage) and applies it to the baseline function result.
`percentile()`	Returns the specified percentile of a metric data. Currently, the following percentile values (integer) are supported: 50, 75, 90, 95, and 99.

Supported Operators

You can use the following operators in your expressions:

Operator	Description
`and`	Returns a true value if all conditions in an expression are true. If any one condition is false, the operator returns a false value.
`or`	Returns a true value if any one condition in an expression is true.
`between`	Checks if the left hand side value is within the range of the right hand side value in an expression. For example, `3 between[2,5]` returns true.
`notBetween`	Checks if the left hand side value is not within the range of the right hand side value in an expression. For example, 1 notBetween [2,5] returns true.
`<`	Checks if the left hand side value is smaller than the right hand side value in an expression.
`>`	Checks if the left hand side value is larger than the right hand side value in an expression.
`=`	Checks if the left hand side value is equal to the right hand side value in an expression.
`!=`	Checks if the left hand side value is not equal to the right hand side value in an expression.
`<=`	Checks if the left hand side value is less than or equal to the right hand side value in an expression.
`>=`	Checks if the left hand side value is larger than or equal to the right hand side value in an expression.

Sample Condition Expressions

The following table illustrates a few sample condition expressions:

Condition	Expression
The maximum value for the calls per minute (calls_per_minute) metric from the APM source in the last 30 min is greater than 10.	`metrics("apm:calls_per_min", "apm"))[timestamp > (now - 30m)].max() > 10`
The count of the metric value for the calls_per_minute metric from the APM source is greater than 10 and 16 out of 30 times in the last 30 minutes.	`metrics("apm:calls_per_min", "apm")[timestamp > (now - 30m) && value > 10].count() > 16`
The value for the ratio of calls_per_minute and memory limit (memory_limit) metric from the APM source in the last 30 minutes is greater than 5.	`metrics("apm:memory_usage", "apm") / metrics("apm:memory_limit", "apm"))[timestamp > (now - 30m)].value() > 5`
The value for the calls_per_minute metric from the APM source in the last 30 minutes is greater than 2 standard deviations from the baseline value for baseline configuration Daily Trend - Last 30 days. In this case, the metric expression is compared with the baseline value. For information about baselines, see Cisco Cloud Observability documentation.	`metrics("apm:calls_min", "apm")[timestamp > (now - 30m)].value() > metrics("apm:calls_min", "apm").baseline("Daily Trend - Last 30 days").stdDev(2)`
The value for the calls_per_minute metric from the APM source in the last 30 minutes is greater than 30 percent from the baseline value for the baseline configuration Daily Trend - Last 30 days. In this case, the metric expression is compared with the baseline value.	`metrics("apm:calls_min", "apm")[timestamp > (now - 30m)].value() > metrics("apm:calls_min", "apm").baseline("Daily Trend - Last 30 days").percentageRange(30)`
The value for the calls_per_minute metric from the APM source in the last 30 minutes is within the range of 2 standard deviations from the baseline value for baseline configuration Daily Trend - Last 30 days. In this case, the metric expression is compared with the baseline value.	`metrics("apm:calls_min", "apm")[timestamp > (now - 30m)].value() between metrics("apm:calls_min", "apm").baseline("Daily Trend - Last 30 days").stdDevRange(2)`
The value for the calls_per_minute metric from the APM source in the last 30 minutes is not within the range of 2 standard deviations from the baseline value for baseline configuration Daily Trend - Last 30 days. In this case, the metric expression is compared with the baseline value.	`metrics("apm:calls_min", "apm")[timestamp > (now - 30m)].value() notBetween metrics("apm:calls_min", "apm").baseline("Daily Trend - Last 30 days").stdDevRange(2)`
The count of the 50th percentile value of the calls_per_minute metric from the APM source is higher than 10 and is more than 15 times in the last 30 minutes.	`metrics("apm:calls_min", "apm")[timestamp > (now - 30m) && percentile(50) > 10].count() > 15`
The count of events of the type k8s:native_event from the source infra-agent with severity "Severe" and description "Pod Restarted" is more than 100 in last 30 minutes.	`events("k8s:native_event")[source = 'infra-agent' && attributes(severity) = 'SEVERE' && attributes(description) = 'Pod Restarted'][timestamp > (now - 30m)] > 100"`