Platform Engineering 09 - Observability for Platform Engineers

Monitoring vs Observability

Monitoring and observability may look similar, but they answer fundamentally different questions. Monitoring focuses on determining whether the system is up, while observability focuses on understanding why the system is behaving the way it is.

	Monitoring	Observability
Question	"Is it up?"	"Why is it doing this?"
Approach	Predefined checks	Explore arbitrary questions
Best for	Known failure modes	Unknown failure modes

Monitoring tells you that something is wrong. Observability, on the other hand, lets you trace and understand specifically what went wrong and why. For simple systems with predictable failure patterns, monitoring alone may suffice. But in microservice environments where failures emerge in unexpected ways, observability becomes essential.

The Three Pillars

Observability is built on three types of signals.

Logs record discrete events -- information such as "User 123 failed auth at 14:32:01, token expired." Metrics aggregate numerical data over time, taking forms like "p99 latency is 450ms" or "error rate is 2.3%," which are useful for gauging overall system health. Traces follow a single request's journey across multiple services -- revealing flows such as "hit service A, passed to service B, waited 800ms on the database."

When combined, these three signals take you from a vague symptom like "something is slow" to a specific cause like "this query in service B is slow for users in region X." Each signal alone provides only a partial picture, but together they enable a three-dimensional understanding of how the system actually behaves.

OpenTelemetry

Before OpenTelemetry (OTel), every vendor used its own SDK, agent, and data format. Switching vendors in that landscape meant rewriting instrumentation code across every service. The deeper the integration, the worse the vendor lock-in became.

OTel solves this problem as a vendor-neutral standard. You instrument once and send data to whatever backend you choose.

App (OTel SDK) --> OTel Collector --> Backend (Grafana/Datadog/etc.)

When the platform team provides the Collector as infrastructure and pre-configures SDKs in golden path templates, every service gains observability out of the box without any additional effort from development teams.

Observability as a Service

What happens when you ask each team to set up their own logging, metrics, and tracing? Six months later, half of them have no traces at all, and the other half are sending meaningless data. This is an inevitable outcome when teams have varying skill levels and competing priorities.

The platform approach is fundamentally different. The platform team operates centralized collectors, storage, and dashboards, and configures auto-instrumentation through base images or sidecars. Default dashboards that work immediately are provided alongside, so each team only needs to add customizations specific to their own service on top of this foundation.

When a new service is deployed, it should have logs, metrics, and traces from day one without any setup. That is what it means to provide observability as a service.

SLIs, SLOs, and Alerting

No matter how rich your observability data is, without context it is just noise. The role of SLIs and SLOs is to give that data meaning.

An SLI (Service Level Indicator) is a specific metric that measures service health, defined in forms like "the percentage of requests that respond within 200ms." An SLO (Service Level Objective) defines the target for that SLI -- "99.5% of requests should respond within 200ms over a 30-day window." The platform team provides the tooling, and each team defines SLIs appropriate to their service, sets targets, and receives automatic alerting.

There is an important principle when configuring alerts: alert on symptoms, not causes. The correct approach is to alert on "error rate exceeds 1%" rather than "CPU usage exceeds 80%." High CPU usage that does not affect users does not warrant an urgent response. Furthermore, every alert must have a clear, actionable response for whoever receives it. An alert with no corresponding action is not an alert -- it is noise.

Every service should come with a default dashboard showing request rate, error rate, and latency. Teams should be able to add custom panels on top without having to fork the entire dashboard.

Next Up

You cannot fix what you cannot see. When observability is provided as a default rather than an opt-in, the nature of incident response and debugging changes fundamentally.

In the next post, we cover security and governance baked into the golden path.

Where to go next

Monitoring vs Observability

The Three Pillars

OpenTelemetry

Observability as a Service

SLIs, SLOs, and Alerting

Next Up

Continue Reading

Platform Engineering 10 - Security and Governance

Platform Engineering 11 - Building a Platform Team

MLOps 06 - Monitoring and Drift Detection