June 17, 2025 5 min read

๐Ÿง‘โ€๐Ÿš€ Operational Readiness & Resilience III: Observability

Luke Curtis

Luke Curtis

Engineering Leader

Header image

Through the lens of an Engineering Leader

Most engineers think of observability as logs, dashboards, and alerts and this is an important part of the puzzle. But as a leader, your role is to ensure your teams arenโ€™t just reacting to problems with these insights but moreso, theyโ€™re building systems that are transparent, measurable, and reliable from day one.

Whether it's catching performance issues before they trigger customer complaints, or giving product managers insights into how features are used (or not used), observability is your frontline defense against surprises in production and also to verify when there actually is an issue.

It's worth noting that this topic does have some good crossovers with my CAPS strategy post, so I would advise coupling the thinking here slightly to ensure you're covering the broadest baseline.

Technical Observavbility

Logging, metrics, and traces are table stakes, but proactive technical observability goes beyond.

You're looking to answer:

Key Areas to Cover:

Area What to Look For Example
Utilization Metrics CPU/memory/disk/db/redis/queue throughput Spike in Redis latency during cache invalidation โ€” causes slow checkouts
Anomaly Detection Threshold-based alerts on errors, latency, etc. Surge in 5xxs after a deploy triggers rollback via Datadog monitor
Inbound & Outbound Traffic Monitor upstream/downstream health A downstream auth service outage surfaces via failed outbound retries

Product Observability

This is about understanding how users interact with your features โ€” not just if something broke, but how it's being used, where friction is, and how it contributes to outcomes.

Principles:

Key Areas to Cover:

Area What to Look For Example
Feature Adoption Who's using the new dashboard? How often? Admin dashboard adoption is <5% of users โ€” maybe we overbuilt it
Drop-off Points Where do people abandon a flow? 40% of users never complete onboarding โ€” logs show phone verification failure
Operational Visibility What internal tools help debug user issues? Admin portal shows "last known state" of a user subscription and events leading to it
Luke Curtis

Luke Curtis

Engineering Leader with over 10 years of experience in building and leading high-performing teams. Passionate about transforming organizations through technical excellence and empowered engineering cultures.

Stay Updated

Subscribe to receive the latest insights and articles directly in your inbox.