Cloud Native Best Practices #5: Observability & Monitoring

October 11, 2019

To quote Michael Dell, “the cloud isn’t a place, it’s a way of doing IT.“ As IT becomes more and more central to what every company does, understanding cloud native best practices is key not only for the IT department – but for every part of a business. This post is the fifth of a seven-part series examining how cloud native can help businesses deliver on their promise of better, faster, and cheaper. This part shows how observability and monitoring are key to providing a better level of service to customers through the faster discovery of problems all while identifying areas to reduce costs.

Cloud Native Best Business Practices: Observability and Monitoring

To keep up with the quickening pace of business, companies are increasingly turning towards cloud native technologies with 70% of enterprises reporting that they are beginning to adopt or have already adopted them. A cornerstone of the cloud native technology definition is observability because it helps companies monitor the health of and optimize the resource consumption of their applications and infrastructure. The importance of observability is also seen in project adoption with Prometheus, the open source monitoring system, becoming the second project, behind Kubernetes, to graduate from the Cloud Native Computing Foundation. Many companies implement Prometheus with Grafana for an open source monitoring stack or turn towards one of the many vendors on the CNCF Landscape. The benefits of observability and monitoring extend far beyond a green status dashboard directly to the bottom line.

As IT departments have grown from a few servers blinking in a backroom to sprawling landscapes of different deployments across multiple data centers and cloud providers, it has become increasingly difficult to even keep track of computing infrastructure, let alone optimizing operations. In an always-on world where consumers and customers expect constant uptime, any impact on application performance can almost immediately be felt by the end user with IT downtimes costing an average of $336,000 per hour. Clearly, being able to quickly – or even proactively – identify any IT problem not only saves companies money, but also provides a better service to their customers.

Looking through the list of Kubernetes Failure Stories, almost every story begins with the operators noticing a degradation of performance in their monitoring system of choice. Monitoring dashboards come up the latest when the debugging begins. When a production outage started, Grafana Labs was able to identify and mitigate the problem in under 10 minutes because they had proper observability in place, including both monitoring and logging. Observability is key to not only finding, but also solving the problems. Slamtec reduced the time spent on debugging and troubleshooting by 50% when they implemented centralized monitoring with Prometheus and Fluentd. Implementing monitoring and observability gives companies the ability to quickly spot and service problems before they begin to have major impacts on customers and the bottom line.

However, observability and monitoring are not only about watching for things to go wrong, they can also help optimize when things are going well. Cloud native technologies provide a powerful platform to dynamically create infrastructure. In many companies, this ability is spread across a multi-tenant (and sometimes multi-cloud) environment. This can create a tragedy of the commons problem where teams overprovision resources to ensure their applications run smoothly. Today, companies waste $14.1 Billion per year on idle and over-provisioned computing resources. By implementing monitoring to understand how effectively resources were being used, the team at Kubecost helped companies reduce their infrastructure spend by 30-70% with one even realizing they were overspending by 500%! The inability to effectively track computing resources clearly impacts the bottom line in multiple ways.

We built Kubermatic Kubernetes Platform with an out-of-the-box integration of Prometheus and Grafana to give our customers the industry-standard open-source tools for observability and monitoring. This allows our customers to have better insight into and oversight of their infrastructure. Using these tools and our Kubermatic Kubernetes Platform operators, we help our customers gain awareness of both the health and cost of their infrastructure. You can read more about how we implemented it here. This allows them to effectively monitor and fix problems before their customers notice and optimize infrastructure costs, including internal chargebacks. Observability and monitoring help cloud native companies create more reliable services with faster time to recovery while improving the bottom line through reduced outages and cost optimization.

Check out part six: security and compliance to understand the impact of cloud native upon the biggest concern for every enterprise.