Health Rules Schema and Expression Language
To create a health rule for your domain entities, you need to understand the schema of a health rule. Any health rule that you create must follow the specified schema.
The Health Rule platform component provides you with two types of schemas:
Expression Language
The health rule expression language helps you to define:
- the entities to be evaluated.
- the conditions to evaluate the entities.
- the roll-up relationship for the entities, which controls the health of the parent entities based on the child entities.
Note: The health rule expression language mentioned in this document is version 2 (v2). The version 1 (v1) is no longer supported.
Evaluation Expression
In the topologyExpression
field of the health rule template, you can define conditions to filter the specific entities that a health rule evaluates. The conditions are a subset of the Unified Query Language (UQL). For information about the UQL, see Unified Query Language User Guide.
The following table illustrates a few examples:
Evaluation Conditions |
Evaluation Expression |
Evaluate all pods in the k8s namespace. |
entities(k8s:pod) |
Evaluate all pods with the attribute k8s.pod.status is running. |
entities(k8s:pod)[attributes("k8s.pod.status")='Running'] |
Evaluate all pods with the tag namespace appdynamics . |
entities(k8s:pod)[tags("namespace")='appdynamics'] |
Evaluate all pods associated with the k8s deployment type. Note that the out.to operator is used to denote association among the entities. Currently, the health rule expression language supports only the following associations that are defined from parent to child entities: ONE_TO_ONE , ONE_TO_MANY , and all hierarchical associations. |
entities(k8s:deployment).out.to(k8s:pod) |
Evaluate all pods with the attribute k8s.pod.status = Running and associated with the k8s deployment type |
entities(k8s:deployment).out.to(k8s:pod)[attributes("k8s.pod.status")='Running'] |
Evaluate all pods with the attribute k8s.pod.status = Running and is associated with the k8s deployment type with the namespace name appdynamics. |
entities(k8s:deployment)[attributes("k8s.namespace.name")='appdynamics'].out.to(k8s:pod)[attributes("k8s.pod.status")='Running'] |
Evaluate all pods with the attribute k8s.pod.status = Running and is associated with the k8s deployment type with namespace name appdynamics and cluster name appd-cluster |
entities(k8s:deployment)[attributes("k8s.namespace.name")='appdynamics'].out.to(k8s:pod)[attributes("k8s.pod.status")='Running' && attributes("k8s.cluster-name")='appd-cluster'] |
Evaluate pods from the abstract type workload, the workload is associated with the k8s namespace and the namespace is associated with cluster with the attribute name as appd-cluster. |
entities(k8s:cluster)[attributes("k8s.cluster.name")='appd-cluster'].out.to(k8s:namespace).out.to(k8s:workload).out.to(k8s:pod) |
The following sample shows how to define an evaluation expression in your health rule object file:
Dynamic Grouping in Evaluation Expression
In the evaluation condition, you can add filters to monitor the entity types based on the filter criteria. For example, if you want to monitor a specific torpedo that has the name Black Shark, use the following condition in the topologyExpression
field:
Note: Dynamics grouping is supported only for metrics expressions.
Rollup Expression
A child entity can define the health of the parent entity by rollup relationship. The health of a parent entity can be rolled up based on the child entity. The path of the child entity is defined in the evaluation expression.
For example, in the following topologyExpression
:
entities(k8s:cluster)[attributes("k8s.cluster.name")='appd-cluster'].out.to(k8s:namespace).out.to(k8s:workload).out.to(k8s:pod
The relationship among the entities can be defined as:
k8s:cluster
is the parent entity of k8s:namespace
k8s:namespace
is the parent entity of k8s:workload
k8s:workload
is the parent entity of k8s:pod
Alternatively, you can read the entity relationship as:
k8s:pod
is the child entity of k8s:workload
k8s:workload
is the child entity of k8s:namespace
k8s:namespace
is the child entity of k8s:cluster
To define the health rollup relationship among the entities, you need to use the following format in your health rule object file:
The preceding sample indicates the following health rollup relationship:
- If 50% of
k8s:pod
is unhealthy, then k8s:workload
is unhealthy.
- If 40% of
k8s:workload
is unhealthy, then k8s:namespace
is unhealthy.
- If two
k8s:namespace
entities are unhealthy, then k8s:cluster
is unhealthy.
Boolean Condition Expression
A condition consists of a single or multiple statements that evaluate different metrics, events, and logs. When you define multiple conditions, you may want to define evaluation criteria using a boolean expression.
The advantages of using a boolean expression are:
eliminates the need to create multiple health rules to monitor various performance metrics. Using a boolean expression allows you to evaluate complex criteria for multiple conditions in one go.
well-calibrated boolean expression ensures reduced false alerts.
easy to create and maintain health rules with complex evaluation criteria using simple condition names. Conditions are named as A, B, C, and so on.
allows the use of and or operators to define a highly complex boolean expression.
The following sample illustrates how to use a boolean expression in the criteriaExpression
field:
Boolean "(A or B)":
Boolean with condition:
Leaf Condition
The leaf condition of a health rule configuration represents one of the conditions that must be evaluated. You can create leaf conditions for metrics, events, and logs.
The output of a leaf condition is:
If the required data is not available, then the unknown result is represented by -1.
Example:
Consider the following leaf conditions for metrics and their specified labels:
Condition |
Label |
metrics("apm:calls_per_min", "apm"))[timestamp > (now - 30m)].value() > 10 |
A |
metrics("apm:memory_limit", "apm")[timestamp > (now - 30m)].value() > 25 |
B |
metrics("apm:memory_usage", "apm")[timestamp > (now - 30m)].value() > 15 |
C |
You can form a boolean expression by combining conditions such as (A and (B or C))
.
Points to Consider:
You can form a boolean expression by combining multiple leaf conditions of the same type such as multiple metric-based conditions, event-based conditions, or log-based conditions. However, a boolean expression formed by combining an event-based condition, a metric-based condition, and a log-based condition is not supported.
You can't configure an event-based condition to trigger when a violation occurs x times in the last y minutes.
Supported Functions
You can use the following functions in your expressions:
Function |
Description |
count() |
Returns the number of measurements of a metric data for a time period. |
max() |
Returns the maximum value of a metric data for a time period. |
min() |
Returns the minimum value of a metric data for a time period. |
stdDev() |
Returns the deviation of a metric data from the mean or average value. |
sum() |
Returns the sum of the metric values for a time period. |
value() |
Returns a reference to the underlying function of a metric category. |
baseline() |
Returns the value for the baseline configuration. |
stdDevRange() |
Returns a range of values for the baseline mean and standard deviation values. It takes the argument as an integer (standard deviation point) and applies it to the baseline function result. |
percentageRange() |
Returns a range of values for the baseline mean and standard deviation values. It takes the argument as an integer (percentage) and applies it to the baseline function result. |
percentile() |
Returns the specified percentile of a metric data. Currently, the following percentile values (integer) are supported: 50, 75, 90, 95, and 99. |
Supported Operators
You can use the following operators in your expressions:
Operator |
Description |
and |
Returns a true value if all conditions in an expression are true. If any one condition is false, the operator returns a false value. |
or |
Returns a true value if any one condition in an expression is true. |
between |
Checks if the left hand side value is within the range of the right hand side value in an expression. For example, 3 between[2,5] returns true. |
notBetween |
Checks if the left hand side value is not within the range of the right hand side value in an expression. For example, 1 notBetween [2,5] returns true. |
< |
Checks if the left hand side value is smaller than the right hand side value in an expression. |
> |
Checks if the left hand side value is larger than the right hand side value in an expression. |
= |
Checks if the left hand side value is equal to the right hand side value in an expression. |
!= |
Checks if the left hand side value is not equal to the right hand side value in an expression. |
<= |
Checks if the left hand side value is less than or equal to the right hand side value in an expression. |
>= |
Checks if the left hand side value is larger than or equal to the right hand side value in an expression. |
Sample Condition Expressions
The following table illustrates a few sample condition expressions:
Condition |
Expression |
The maximum value for the calls per minute (calls_per_minute) metric from the APM source in the last 30 min is greater than 10. |
metrics("apm:calls_per_min", "apm"))[timestamp > (now - 30m)].max() > 10 |
The count of the metric value for the calls_per_minute metric from the APM source is greater than 10 and 16 out of 30 times in the last 30 minutes. |
metrics("apm:calls_per_min", "apm")[timestamp > (now - 30m) && value > 10].count() > 16 |
The value for the ratio of calls_per_minute and memory limit (memory_limit) metric from the APM source in the last 30 minutes is greater than 5. |
metrics("apm:memory_usage", "apm") / metrics("apm:memory_limit", "apm"))[timestamp > (now - 30m)].value() > 5 |
The value for the calls_per_minute metric from the APM source in the last 30 minutes is greater than 2 standard deviations from the baseline value for baseline configuration Daily Trend - Last 30 days. In this case, the metric expression is compared with the baseline value. For information about baselines, see Cisco Cloud Observability documentation. |
metrics("apm:calls_min", "apm")[timestamp > (now - 30m)].value() > metrics("apm:calls_min", "apm").baseline("Daily Trend - Last 30 days").stdDev(2) |
The value for the calls_per_minute metric from the APM source in the last 30 minutes is greater than 30 percent from the baseline value for the baseline configuration Daily Trend - Last 30 days. In this case, the metric expression is compared with the baseline value. |
metrics("apm:calls_min", "apm")[timestamp > (now - 30m)].value() > metrics("apm:calls_min", "apm").baseline("Daily Trend - Last 30 days").percentageRange(30) |
The value for the calls_per_minute metric from the APM source in the last 30 minutes is within the range of 2 standard deviations from the baseline value for baseline configuration Daily Trend - Last 30 days. In this case, the metric expression is compared with the baseline value. |
metrics("apm:calls_min", "apm")[timestamp > (now - 30m)].value() between metrics("apm:calls_min", "apm").baseline("Daily Trend - Last 30 days").stdDevRange(2) |
The value for the calls_per_minute metric from the APM source in the last 30 minutes is not within the range of 2 standard deviations from the baseline value for baseline configuration Daily Trend - Last 30 days. In this case, the metric expression is compared with the baseline value. |
metrics("apm:calls_min", "apm")[timestamp > (now - 30m)].value() notBetween metrics("apm:calls_min", "apm").baseline("Daily Trend - Last 30 days").stdDevRange(2) |
The count of the 50th percentile value of the calls_per_minute metric from the APM source is higher than 10 and is more than 15 times in the last 30 minutes. |
metrics("apm:calls_min", "apm")[timestamp > (now - 30m) && percentile(50) > 10].count() > 15 |
The count of events of the type k8s:native_event from the source infra-agent with severity "Severe" and description "Pod Restarted" is more than 100 in last 30 minutes. |
events("k8s:native_event")[source = 'infra-agent' && attributes(severity) = 'SEVERE' && attributes(description) = 'Pod Restarted'][timestamp > (now - 30m)] > 100" |