Platform Engineering 09 - Observability for Platform Engineers
Monitoring vs Observability
Monitoring and observability may look similar, but they answer fundamentally different questions. Monitoring focuses on determining whether the system is up, while observability focuses on understanding why the system is behaving the way it is.
| Monitoring | Observability | |
|---|---|---|
| Question | "Is it up?" | "Why is it doing this?" |
| Approach | Predefined checks | Explore arbitrary questions |
| Best for | Known failure modes | Unknown failure modes |
Monitoring tells you that something is wrong. Observability, on the other hand, lets you trace and understand specifically what went wrong and why. For simple systems with predictable failure patterns, monitoring alone may suffice. But in microservice environments where failures emerge in unexpected ways, observability becomes essential.
The Three Pillars
Observability is built on three types of signals.
Logs record discrete events -- information such as "User 123 failed auth at 14:32:01, token expired." Metrics aggregate numerical data over time, taking forms like "p99 latency is 450ms" or "error rate is 2.3%," which are useful for gauging overall system health. Traces follow a single request's journey across multiple services -- revealing flows such as "hit service A, passed to service B, waited 800ms on the database."
When combined, these three signals take you from a vague symptom like "something is slow" to a specific cause like "this query in service B is slow for users in region X." Each signal alone provides only a partial picture, but together they enable a three-dimensional understanding of how the system actually behaves.
OpenTelemetry
Before OpenTelemetry (OTel), every vendor used its own SDK, agent, and data format. Switching vendors in that landscape meant rewriting instrumentation code across every service. The deeper the integration, the worse the vendor lock-in became.
OTel solves this problem as a vendor-neutral standard. You instrument once and send data to whatever backend you choose.
App (OTel SDK) --> OTel Collector --> Backend (Grafana/Datadog/etc.)
When the platform team provides the Collector as infrastructure and pre-configures SDKs in golden path templates, every service gains observability out of the box without any additional effort from development teams.
Observability as a Service
What happens when you ask each team to set up their own logging, metrics, and tracing? Six months later, half of them have no traces at all, and the other half are sending meaningless data. This is an inevitable outcome when teams have varying skill levels and competing priorities.
The platform approach is fundamentally different. The platform team operates centralized collectors, storage, and dashboards, and configures auto-instrumentation through base images or sidecars. Default dashboards that work immediately are provided alongside, so each team only needs to add customizations specific to their own service on top of this foundation.
When a new service is deployed, it should have logs, metrics, and traces from day one without any setup. That is what it means to provide observability as a service.
SLIs, SLOs, and Alerting
No matter how rich your observability data is, without context it is just noise. The role of SLIs and SLOs is to give that data meaning.
An SLI (Service Level Indicator) is a specific metric that measures service health, defined in forms like "the percentage of requests that respond within 200ms." An SLO (Service Level Objective) defines the target for that SLI -- "99.5% of requests should respond within 200ms over a 30-day window." The platform team provides the tooling, and each team defines SLIs appropriate to their service, sets targets, and receives automatic alerting.
There is an important principle when configuring alerts: alert on symptoms, not causes. The correct approach is to alert on "error rate exceeds 1%" rather than "CPU usage exceeds 80%." High CPU usage that does not affect users does not warrant an urgent response. Furthermore, every alert must have a clear, actionable response for whoever receives it. An alert with no corresponding action is not an alert -- it is noise.
Every service should come with a default dashboard showing request rate, error rate, and latency. Teams should be able to add custom panels on top without having to fork the entire dashboard.
Next Up
You cannot fix what you cannot see. When observability is provided as a default rather than an opt-in, the nature of incident response and debugging changes fundamentally.
In the next post, we cover security and governance baked into the golden path.