From the course: Mastering Observability with OpenTelemetry
A brief history of monitoring and observability - OpenTelemetry Tutorial
From the course: Mastering Observability with OpenTelemetry
A brief history of monitoring and observability
- [Instructor] Before we dive into the practical application of observability, it makes sense to understand why it exists and how it developed. Until around 2010, applications were mostly monoliths. The graphic shows a typical multi-tier application with a presentation layer, a business logic layer, and a data access layer, all in one process. Of course, these applications had errors and performance problems and bugs too. Common questions back then were why is my application slow, what caused this Error 500 or is my hardware well-utilized? Because often these monoliths were running on dedicated machines or virtual machines, and it was a cost factor if it was over-provisioned or under provisioned. Consequently, people came up with ways to monitor monoliths. They collected metrics like for memory, CPU usage or also response time. They collected logs to find errors fast. For instance, an exception in a log file can give you a lot of hints about the problem. They collected metrics like memory or CPU usage, but also response time. They collected logs to find errors fast. For instance, an exception in a log file can give you a lot of hints about the problem. And they collected profiles of memory and CPU utilization. And there were also first attempts to measure the execution of functions, which was an early version of so-called tracing. Fast forward to from 2010 onwards, we saw a shift towards something called microservices, and this is still the architecture of choice today. Instead of monoliths where everything was running in one process, we now have independent, small applications with a limited set of concerns. Additionally, managed cloud services became popular. Think of DynamoDB, AWS Lambda, Google Cloud functions and the likes. So now instead of one applications, we have many and to make things more complex, the arrows that you see here represent the communication between these services and this communication is now via the network, be it through REST, gRPC, Queues or GraphQL. People in charge of such systems now have a set of new questions like why is my application slow and which service is the root cause? This is a similar question to what we had with monoliths, but things get more complicated from there, like which service and which operation caused this Error 500? It's harder now to know where an error that bubbles up to the user really originated. Or which, and how many services are there even? Microservice architectures can grow big and can be managed by multiple teams. It's easy to lose oversight there. Another question is, is there a problem with a third party service? Insights into third party services like a serverless function is often limited. How can I find out if it's working correctly? Another question would be similar to what we saw before. Did I reserve enough resources like memory or CPU for my services? In a containerized world, maybe orchestrated by Kubernetes, I have to know if the resources allocated, like memory and CPU, meet the needs of my services. But let's look at the practical example. Say you have an e-commerce application. When a user clicks on checkout, a middleware is called, which calls the cart service, and then some currency service running as a serverless function, which then calls some third party service to do the real time currency conversion. Let's say that you monitor this checkout operation and increase a counter for each successful and each failed operation. On a Monday, you look at the chart and you can see that roughly 10% of the checkout operations failed. This is what metrics give you. Now to find the root cause, you maybe go through the logs, hopefully all services sent them to a central place. Maybe the logs give you a hint where the problem comes from. If not, you may try to run the services in a development environment and start debugging. But what if you can't reproduce this problem? Hours, maybe days pass. The revenue is 10% down and your boss is already breathing into your neck. There has to be a better way. And there is. It's called distributed tracing, and we will discuss it in the next video.