How to improve system and application health in cloud-native environments

The key to proactive observability are regular, well-defined application health checks

Estimated time to read: 11 minutes

The need to understand system and application health has remained a top priority for IT and DevOps teams throughout the past decade. More recently, the cloud environment has complicated the methods used by developers to understand system and application health. Simple monitoring once was the method of choice — before microservices and multi-cloud environments entered the picture. Now, monitoring multiple systems only further complicates an already complicated system.

To make sense of this complexity, track application health, and glean actionable insights to improve systems, developers have been turning to full-stack observability.

In this article, we’ll take a look at how cloud-native environments have changed the way developers monitor their systems and dive into the charge leading the era — full-stack observability — and explore how it can help developers quickly respond to system degradation and failures.

Challenges of monitoring cloud-native applications

Cloud-native applications have gained popularity due to a number of features that are important in todays world. However, the same features that allow cloud-native applications to be dynamic and scalable make them more complicated to monitor using traditional monitoring methods. This section will explore how the features that make cloud-native applications popular have also challenged monitoring.

Challenge 1: Dynamic and scalable applications

Immutable infrastructure is a fundamental feature of cloud-native applications. The idea behind immutable infrastructure is, when a resource is obsolete or even malfunctioning, you replace it completely, instead of updating it in place. Because of this, cloud-native applications (like those orchestrated by Kubernetes) have the following dynamic features:

Applications run in containers with short lifespans
Traffic is shaped dynamically using service meshes
Systems are scaled intelligently

All of this is in stark contrast to traditional applications deployed on fixed, known hardware. Monitoring is complicated by this fluidity.

Challenge 2: Highly distributed systems

Cloud-native applications are extremely microservices-oriented. They run on different platforms and produce different log and metric formats in order to improve overall agility and maintainability. However, this introduces challenges in standardization. These complex interactions and dependencies make it difficult to get a firm grasp on understanding your system. As an added challenge, services can be spread across locations and cloud providers. This highly distributed nature requires a much more complex approach to understanding the health of the system. Different services are often maintained by separate teams, which makes propagating the right amount of context between teams difficult.

Quick win to overcome this challenge: We’ll talk about the solutions to these challenges later in the article, but a quick tip to reduce this complexity: These complex interactions between services can be managed by using service meshes. Service meshes provide service discovery, load balancing, traffic shaping, failure recovery, and policy enforcement. Observability in cloud-native environments can integrate with service meshes to offer a complete and coherent picture of the system.

Challenge 3: Rapid evolution, automation, and CI/CD processes

There is an emphasis on automation and Continuous Integration/Continuous Deployment (CI/CD) processes with cloud-native apps. While this rapid evolution offers benefits in terms of speed, reliability, and efficiency, it has implications for application monitoring. Monitoring often can’t keep up with the speed of changes going to production and rolling deployments. Further, these changes produce a large volume of non-uniform data that traditional monitoring systems can’t anticipate and diagnose new types of failures coming from updated systems.

Challenge 4: Unknown failure modes

On a similar note, traditional monitoring approaches expect failure in a clearly understood and defined way. However, the new features outlined above introduce new failure modes that are unknown and unpredictable. New kinds of failures take longer to detect and fix.

Quick win to overcome this challenge: Cloud observability platforms can surface issues and relevant information without being configured for specific problems. Instead of taking a reactive approach, i.e., monitoring for failures that are already known, cloud systems demand a proactive approach where tooling can detect and report deviations from desired behavior and present it in an actionable way.

Solutions to improve system and application health

The following section will dive into how developers can battle the complexity explored above, and improve their system and application health through setting SMART goals, practicing visual data analysis, and, of course, moving towards an observability-first, proactive approach.

Solution 1: Set clear observability goals

A good monitoring approach is grounded in business goals. Plain systems monitoring is a thing of the past. With cloud-native applications, the complexity and fluidity of the architecture means that it’s simply not feasible to figure out all failure modes ahead of time.

This is why the practice of site reliability engineering (SRE) embraces failure. Simply witnessing failures is useless without the context of end-user impact. For example, if certain clustered web application servers fail in a cloud-native system that typically supports auto-healing, you don’t have to take any action. If certain errors happen within acceptable thresholds (error budgets), engineering time is better spent elsewhere.

To understand what to monitor, a developer needs to determine how services are being delivered to the end users, and the business impact of these services. Observing digital touch points with customers should be associated with the applications that power them. Similarly, observing the applications themselves should be associated with the infrastructure and platforms they run on.

Set “SMART” goals

You have to know what you want to achieve to figure out what to measure. The SMART framework lets you come up with goals that are Specific, Measurable, Attainable, Relevant, and Time-Bound.

Your observability practice should allow you to associate measurable business KPIs or metrics with your application performance and the platforms and infrastructure that run it.

Solution 2: Implement full-stack observability

Teams need a holistic understanding of their system health by observing and correlating the different layers of their IT stack and business goals. Unfortunately, these various layers often use disjointed tools.

Let’s take an example of a large e-commerce store selling products in multiple cities and consisting of various teams (e.g., engineering, business analysts, and leadership), where degraded infrastructure is ultimately leading to a drop in sales. The following table shows how different teams of an organization focus on their specific views of the system, leading them to different issues they deem important.

Category	Example
Business KPI Monitoring	Sales trend Decreased checkouts Increased abandoned carts Reduced up-sell recommendation conversions Slightly lower than usual success rate on a particular payment gateway
Application Performance Monitoring	High error rate for cart service High latency for payments web service Slow queries reported out of payments database High error rate for recommendations API due to requests timing out
CI/CD Dashboard	New version of payments service deployed 45 minutes ago New version of catalog API deployed 60 minutes ago
Real User Monitoring	12% of users from a particular region affected by broken checkout functionality in the past hour
Synthetic Monitoring	Everything looks fine
Error Tracking	Mobile App: Checkout page broken due to API Cart API: Checkout failed due to payment service failure Payments API: Database connection timed out while getting wallet balance for a particular database shard
Infra Monitoring	Network packet drops detected in zone where database servers are running. High CPU utilization for containers running the recommendations engine

If your biggest contributor to lower sales is abandoned carts due to the busiest city being served by a data center where there are packet drops caused by a misconfiguration in the networking stack. It’s difficult to prioritize fixes for an issue causing the biggest business impact between different layers of the IT architecture. Traditional approaches are siloed. Approaches that don’t contextualize and correlate the telemetry data they collect also fail to gauge the overall impact of cascading effects throughout the system.

All of this leads to competing ideas about how to address problems, which makes prioritizing solutions difficult. Having full-stack observability enables teams to communicate, agree on the root causes, and align on common goals thereby improving cross-functional collaboration.

Full-stack observability provides the ability to determine a system’s health across all components of the technology stack.

Remediation and collaboration are simple when observability consolidates all the siloed information from different tools into one platform.

Solution 3: Practice effective data analysis using visualization and reporting

Understanding user needs is key to designing dashboards and reports that deliver users the information they need in an accessible format. This user-centric approach minimizes cognitive load and maximizes comprehension.

Visualization can be a powerful tool in this process. Here are some examples of various forms of visualization.

Tables

Tables provide detailed values of metrics with high precision where current values are more relevant than historical values or trends.

CPU and mempry utilization table
Figure 1: Table visualization showing data about pods and their CPU and memory utilization (Source: Cisco)

Graphs

Graphs are one of the most popular visualization techniques in full-stack observability. They’re good for showing data trends over time and also help in correlating multiple metrics.

Kafka service graph
Figure 2: Graph visualization showing GC time and heap size of the Kafka service. (Source: Cisco)

Flame graphs

Flame graphs, made popular by Brendan Gregg, help visualize hierarchical data and are widely used to uncover performance hotspots.

Kubernetes Metrics Server Flame graph
Figure 3: Flame graph of the Kubernetes Metrics Server showing time spent by each function call in the stack trace (Source: Cisco)

Counter and Gauge Charts

A counter or gauge chart shows the current value of a given metric, along with capacity or a higher limit. A good example of this is for showing CPU utilization or application health.

Figure 4: Application health shown on a gauge chart (Source: Cisco)

Geographical Charts

Geographical charts show spatial data and distribution across locations. An example is the number of active users in different parts of the world.

Stacked charts

An extension of a graphical visualization, a stacked chart helps show trends of multiple items on the same axis, as well as their individual share and accumulated value.

Figure 5: Geographical and stacked charts on Cisco ThousandEyes dashboard (Source: Cisco)

Service maps and topology

A service map is useful for displaying services and their relationship with each other. It also shows additional metrics such as traffic or latency and hierarchy. This is particularly useful in the distributed systems and microservices paradigm.

Kubernetes services

Figure 7: Service maps showing various Kubernetes services and their relation with each other. (Source Cisco)

Visibility, insights, and actions

It is imperative for an observability platform to be able to provide visibility, insights based on this visibility, and recommend actions. To do so, the platform must correlate traces with their corresponding logs and metrics at a particular point in time. Moving beyond siloed metrics, a holistic approach to monitoring enables proactive identification of cascading issues across systems. Logs record events deemed important to the business or software, adding context to the metrics you’re seeing. Traces track the end-to-end behavior of a distributed system, offering insights into what is happening to a request at any point in the application. When combined with metrics and logs, traces can provide a fuller understanding of performance or bugs you’re trying to remediate.

Organizations can also leverage AI and data science to report issues proactively via anomaly detection. Intelligent correlation searches can surface issues across seemingly disparate components. This is especially useful in highly distributed and fast-changing applications where unforeseen problems are likely to arise.

Maintain security and compliance

Another important dimension of system health is security and compliance adherence. Observability platforms can monitor for vulnerabilities, threats, attacks, and misconfigurations in real time to trigger alerts and automated remediation workflows.

Compliance monitoring can oversee the status of various cloud-native assets like security groups, storage buckets, and load balancers, identifying any misconfigurations or violations of established policies.

Platforms can also have popular compliance frameworks such as PCI DSS and CIS benchmarks built in, which facilitate mandatory compliance when aided by application and infrastructure monitoring.

Tips to implement full-stack observability and improve system and application health

An observability platform has to turn all of this information into valuable insights for various teams and provide context that improves the understanding of the system end to end. The following tips will help you get the most out of your observability platform.

1. Implement standardization

When you’re running microservices across multiple teams and platforms, it’s difficult to keep up with your observability needs. Standardizing log formats, metrics emission, and collection, along with templatizing dashboards, reduces the effort required to set up observability for new services; it also promotes knowledge sharing.

2. Practice efficient alerting and incident management

Make sure to prioritize alerts and set up an escalation policy. And be wary of overly aggressive alerting — if everything is urgent, nothing is urgent.

Proactively adjust thresholds so that alerts are actionable. Being continually exposed to alerts where no action is needed desensitizes us to them. This is called alert fatigue and leads to missed alerts or even burnout in folks experiencing it.

Alerts should also contain information that gives responders context and helps them come up with a potential fix faster. This could be the right people to reach out to, incident response runbooks, and other documentation.

3. Adopt continuous learning and improvement

Observability should be used to implement a continuous review and improvement process for system reliability, performance, and efficiency. This allows you to make data-driven decisions around what the most pressing concerns are and where efforts are most likely to yield the best results.

For every issue that escapes into production, take the lesson learned and improve detection and remediation. Each failure should make your system more reliable.

4. Pick the right tools

There are multiple dimensions to picking the best tools based on your constraints and goals. You need a platform that enables you to be goal-oriented and implement full-stack observability.

Be wary of vendor lock-in and prioritize tools that support standard protocols like OpenTelemetry. These enjoy broader support and are also more extensible and customizable.

Another productive choice is choosing a platform that has good out-of-the-box integrations and auto-instrumentation support for your particular technology stack.

Modern platforms also recognize the scale at which cloud-native systems operate, offer data-tiering and intelligent sampling, and leverage new technologies like eBPF for better cost efficiency.

In today's environment, observability is the key to application health

By adopting the right practices, mindset, and tooling, organizations can excel in the demanding landscape of cloud-native systems, properly observing and maintaining good system health.

The industry understands the challenges of implementing effective cloud observability and has recognized platforms like Cisco Observability Platform and that are designed specifically to address them.

Good luck in your cloud-native transformation journey!

Additional Resources and Links:

Article: What are the Golden Signals?
Hub: Cisco Developer Full-Stack Observability Hub
EBook: Cisco Full-Stack Observability
API Tutorial: Health Rules with Cisco Cloud Observability
Sample Code: Use health rules to monitor entity health with Cisco Cloud Observability
API Doc: Cisco Observability Platform Health Rules API