Royal Caribbean Case Study: How a dedicated full-stack observability team keeps Royal Caribbean’s technical systems running
Royal Caribbean identified observability experts like site reliability engineer Chris Ojeda to arm all its applications, web, and networking teams with complete, end-to-end visibility.
Estimated time to read: 13 minutes
For Royal Caribbean to sail more than 65 of the largest cruise ships in the world across three brands, carrying more than 7.5 million passengers annually, while navigating to more than 240 locations around the globe, the organization must have specialized, highly skilled teams. These teams look like everything from world-class captains to an entire in-house production company for on-ship entertainment.
Perhaps one of the most specialized, highly skilled, and most relied-upon teams to ensure cruises can be booked, boarded, and enjoyed is one that lives dockside in Royal Caribbean’s Miami offices: the Royal Caribbean Observability team. While the team’s location remains on dry land, the observability team impacts every part of the Royal Caribbean cruise journey — from on-land to on-ship Wi-Fi — ensuring necessary applications are running smoothly and securely.
The story about why Royal Caribbean implemented full-stack observability is obvious: proactive response to potential issues decreases downtime and improves customer experience. However, it takes dedicated human-power and technology to shift an entire organization towards a proactive observability model.
This article will dive into the how: how Chris Ojeda, site reliability engineer, and the Royal Caribbean Observability team created an observability center of excellence, and what tools are empowering the team to do so.
How a Cisco AppDynamics pilot started Royal Caribbean’s observability journey
Ojeda’s role and the observability team came about after initial success piloting AppDynamics for Royal Caribbean’s Cruise API product. Cisco AppDynamics allowed the Cruise API team to keep their applications running and secure, allowing customer to book cruises on Royal Caribbean’s site as well as partner sites.
Step-by-Step: Building an observability team
Step 1: Determining the need for a dedicated observability team
The first step for Ojeda’s team was understanding the need for observability versus basic monitoring. Ojeda’s background as a technician prepared him to research and fix issues, but he was craving the ability to proactively prevent issues with visibility across the entire system.
Royal Caribbean is unique in that it supports two distinct applications and networks. Critical applications and networks are ‘onshore’ and relied upon for booking, payment, and typical business transactions. Yet, each cruise ship houses its own data center, which communicates back to the on-shore system and provides Wi-Fi and networking to all staff and guests on the ship — hundreds or thousands of miles away from the on-shore system.
Before long, Ojeda and team realized the organization’s unique setup of applications and networks, along with its complex web of microservices, resulted in too much complexity for standard monitoring.
“Infrastructure monitoring just doesn’t cut it, so we started researching tools and agents to get full stack,” says Ojeda. “When I saw that, with AppDynamics, you could literally look at one screen and see where the problem is coming from, it kind of blew my mind. And, on top of that, you have ThousandEyes looking at the network level of the application. With observability, you can see everything.”
Ojeda shared these findings with his manager, discussed the tools available. The need for Ojeda to become an observability resource across the organization became as clear as the crystal blue Caribbean waters. With Director of IT Service Management Alice McElroy at the helm, the Royal Caribbean Observability team was formed, with a mission to become a center of excellence that all teams could lean on to keep their systems running.
“Teams look forward to having the observability team involved,” says Ojeda. “Whether it's a shipboard application or an onshore application, whether it's critical to customers or just an internal application, teams are reaching out to us to understand how their application is performing. And not only that, they also want to be alerted if something is going off the rails for whatever reason. And they want to be able to see it in as clean of a fashion as possible. Our team gives them this ability.”
Step 2: Learning, coding, and building
One of the most important drivers of the observability team’s success was choosing a product and leaning into learning. The team worked together to share knowledge about Cisco AppDynamics, and attended Cisco learning courses.
“My team members were probably the best resources I had, in addition to AppDynamics training,” says Ojeda. “When I first started, there was a good month where I was trying to absorb everything we were doing.”
Ojeda and team took advantage of the short learning curve to understand the basics of Cisco AppDynamics, and got the product up and running. This basic understanding was the building block, and they built from here.
“AppDynamics is not complex, but it can get complex if you want to do cool things, like make extensions to retrieve data the agents wouldn’t be able to pull by themselves,” says Ojeda.
One example of the team’s continuous learning around Cisco AppDynamics is understanding how to utilize the release of the new smart agent, which will help with deploying and upgrading applications.
The agent is still in the test environment, but the team has already realized the time to deploy or upgrade will be reduced from around a month to a few weeks.
Utilizing Developer resources
The team is also experimenting with custom extensions to pull data from pager duty.
“We have some systems that are very legacy and can’t get an agent installed,” says Ojeda. “So, on our AppDynamics scorecard, the team built a custom extension to pull data from PagerDuty. If there was an incident created by another tool, AppDynamics will pick it up and display it on the scorecard as an issue.”
“These are the cool things that we can do, playing around with APIs and custom extensions,” says Ojeda. “We're not limited to just what AppDynamics can do. We have some flexibility to bring some cool programming ideas in and bring some data that we wouldn't have happened upon any other way.”
When the team has questions, they lean on the Cisco account team and DevNet documentation. “I was tasked to learn about AppDynamics and observability at the same time, and the learning curve was pretty simple,” says Ojeda. “I used a lot of documentation, and a lot of support from the account team.”
Building an observability team, step 3: universal adoption of the team’s services
Once the observability team was educated and confident using AppDynamics, the team began offering its services to application teams across the organization. Internal adoption of the team’s services was a key tenant in the team’s success. It didn’t take long for the team to realize the strong appetite for their observability expertise.
“We started receiving emails from teams that would say ‘hey, I’ve heard of you guys, and I’m building this application,’” says Ojeda. “They were hoping to have monitoring and observability built in as they developed their application.”
Soon, what was word-of-mouth became the standard for all application teams. Ojeda and team began weaving observability into all applications from their inception. When a new application is built at Royal Caribbean, “the workflow requires at least a bare minimum of observability with opportunity for expanded extensions.” Ojeda hosts discovery sessions with application teams to provide customized observability solutions for each application and ensure observability is integrated into every solution.
“Now applications should be observed from beginning to end, even in the test environments and development environments,” says Ojeda.
Using observability to plan for future
The observability team has already realized benefits of reduced application downtime and time to resolution, of which we will look at below. But, the team also uses observability data to strengthen new systems as they are being built.
“Even for legacy systems where observability is more complicated and they lend themselves more to infrastructure monitoring, we’ve installed agents so we can measure what their capacity is,” says Ojeda. “We use this data to plan for future systems, so we can build those stronger, with better memory. It’s a great fit for upgrades.”
Benefits of moving to full-stack observability for SREs and Application Developers at Royal Caribbean
The investment in the observability team took time and resources. However, the return on this investment has already been realized.
“In the long run, the observability team will save money, manpower, and hours that would have been spent putting out fires,” says Ojeda. “Yes, it’s an investment. It may be daunting, it may be overwhelming to start. But, once you get the ball rolling and you start observing applications, you’re golden — you immediately start seeing the return on this investment.”
Royal Caribbean has already realized the following benefits as a result of its observability team.
Benefit 1: Mitigated tool sprawl
A dedicated observability team protects against tool sprawl, a common challenge for modern IT teams. Before Ojeda joined and the dedicated observability team was created, Royal Caribbean IT used a variety of tools to monitor individual applications. This caused a level of disorganization with teams using different monitoring methods across the business.
These days, Ojeda and the team have centralized the toolbox, decreasing complexity and providing a standard across the organization.
“We have our own toolboxes that help the teams achieve full stack observability. In the toolbox is Cisco tools like AppDynamics, ThousandEyes, and now Cisco Cloud Observability.”
Benefit 2: Targeted, efficient approach to troubleshooting
In the past, when a major incident occurred, the entire organization would be called in, and every tool was inspected to understand where the problem might have originated.
“When you send a PagerDuty alert to the entire organization, you inadvertently create a war room,” says Ojeda.
And, war rooms create a variety of challenges for an IT organization that is working to get the system up and running as quickly as possible.
- Challenge 1: The time it takes to get people into the war room
- Challenge 2: War rooms are resource hogs — they require attendance by individuals who may not be needed and distract from other priorities
“It was a bulky process,” says Ojeda. “Everything was divided and split up, and when everybody got together, it was time consuming. The time to resolution was much larger.”
Now, with full-stack observability, the team still holds war rooms, but with much smaller teams. Ojeda can look at the entire application on a screen, and check the health rules that have been configured. From this information, he is able to call in the relevant teams to join the war room, and find resolution much faster.
Benefit 3: Real-time, automated application health
Prior to the observability team, Ojeda and team would attend daily calls at 2 p.m. to review system health. The process was batched, manual and time consuming, with teams updating a color-coded Excel sheet to track application health.
“Our system health was monitored on a multi-colored Excel sheet,” says Ojeda. “Everyone from every team joined the call and would report on their system.”
Green meant their system was good; yellow, potentially overloaded; red, the team was in trouble. One of Ojeda’s projects upon adopting observability was to transfer this Excel document to a scorecard on AppDynamics. All applications and systems were configured into the scorecard. Now Ojeda and team can look down the scorecard in real-time and see that the entire system is healthy.
“There is no need to go and ask the team what the status of their system is,” says Ojeda. “It saves a lot of time to be able to look at a green circle in AppDynamics and say ‘your system is good’ or see a yellow or red circle and dig into that specific area and ask ‘what is going on here?’”
Benefit 4: Improved team productivity
Before Royal Caribbean, in Ojeda’s prior life as a technician, putting out fires was the norm. He would receive a critical ticket, sometimes at 3 in the morning, and work to resolve it right away. These days, with a proactive observability model, Ojeda and team are able to prioritize and plan, resulting in reduced stress and increased ability to spend time learning.
Benefit 5: Improved metrics
As common with application developers, metrics vary for every team because of distinct service level agreements and the type of performance that is monitored on each application. Still, Ojeda has witnessed an improvement in network performance, business transaction performance, and microservice performance by implementing ThousandEyes and AppDynamics.
“We provide specific metrics for each team to analyze how their system is running,” says Ojeda. “Some applications need to be running 24/7, so the metric we look at there is: did the application stop running? On the aggregate, though, we have seen an improvement in business transaction performance and network metrics across all teams.”
Benefit 6: Improved mitigation of ship or onshore network issues
The complexity that came from ship and onshore networks is still there — but Ojeda’s team has been able to make sense of it. Observability has allowed the team to pinpoint issues and quickly determine what is going on.
“We actually had an issue yesterday with one of the clusters on one of the ships,” says Ojeda. “Each ship is its own data center, and each ship needs its on individual attention. The point clusters for the OpenShift hybrid cloud based systems had an issue on one of the ships and I was able to use ThousandEyes to identify it and start working on it right away.”
Royal Caribbean’s evolved plans for observability
Ojeda and team will continue iterating their observability solution by adding visibility to every layer.
“We’re in a really good place right now. When an application team comes to us and requests full-stack observability, we can implement that pretty quickly – within days” says Ojeda. “My focus this year will be to mature our cloud native side with Cisco Cloud Observability, as well as continue to train on the platform.”
With any development team, change is the only constant. But, with a dedicated observability team that is able to proactively mitigate issues and free time for planning and research, Royal Caribbean is able to stay best-in-class and continue bringing customer teams to life at sea.
Resources:
Developer Site: Full-stack observability
Learning Lab: Full-stack observability
Infographic: Unraveling Endpoint Complexity
Developer Site: AppDynamics
Documentation: ThousandEyes
Learning Lab: ThousandEyes
Overall Business Case Study: Royal Caribbean