Header image

Through the lens of an Engineering Leader

Most engineers think of observability as logs, dashboards, and alerts and this is an important part of the puzzle. But as a leader, your role is to ensure your teams aren’t just reacting to problems with these insights but moreso, they’re building systems that are transparent, measurable, and reliable from day one.

Whether it's catching performance issues before they trigger customer complaints, or giving product managers insights into how features are used (or not used), observability is your frontline defense against surprises in production and also to verify when there actually is an issue.

It's worth noting that this topic does have some good crossovers with my CAPS strategy post, so I would advise coupling the thinking here slightly to ensure you're covering the broadest baseline.

Technical Observavbility

Logging, metrics, and traces are table stakes, but proactive technical observability goes beyond.

You're looking to answer:

What do our normal patterns look like?
When do things begin to degrade?
Can we detect and act on issues before customers feel them?

Key Areas to Cover:

Area	What to Look For	Example
Utilization Metrics	CPU/memory/disk/db/redis/queue throughput	Spike in Redis latency during cache invalidation — causes slow checkouts
Anomaly Detection	Threshold-based alerts on errors, latency, etc.	Surge in 5xxs after a deploy triggers rollback via Datadog monitor
Inbound & Outbound Traffic	Monitor upstream/downstream health	A downstream auth service outage surfaces via failed outbound retries

Product Observability

This is about understanding how users interact with your features — not just if something broke, but how it's being used, where friction is, and how it contributes to outcomes.

Principles:

No database deep dives: Product teams should ideally never have to ask engineers to find out what happened.
Track usage from day one: Feature flags, funnels, and domain-specific events tell the real story.
Don’t forget unhappy paths: Failures, retries, drop-offs, these are often heavily overlooked in my experience.

Key Areas to Cover:

Area	What to Look For	Example
Feature Adoption	Who's using the new dashboard? How often?	Admin dashboard adoption is <5% of users — maybe we overbuilt it
Drop-off Points	Where do people abandon a flow?	40% of users never complete onboarding — logs show phone verification failure
Operational Visibility	What internal tools help debug user issues?	Admin portal shows "last known state" of a user subscription and events leading to it

🧑‍🚀 Operational Readiness & Resilience III: Observability

Through the lens of an Engineering Leader

Technical Observavbility

Product Observability

Luke Curtis

Stay Updated