
Through the lens of an Engineering Leader
Most engineers think of observability as logs, dashboards, and alerts and this is an important part of the puzzle. But as a leader, your role is to ensure your teams arenโt just reacting to problems with these insights but moreso, theyโre building systems that are transparent, measurable, and reliable from day one.
Whether it's catching performance issues before they trigger customer complaints, or giving product managers insights into how features are used (or not used), observability is your frontline defense against surprises in production and also to verify when there actually is an issue.
It's worth noting that this topic does have some good crossovers with my CAPS strategy post, so I would advise coupling the thinking here slightly to ensure you're covering the broadest baseline.
Technical Observavbility
Logging, metrics, and traces are table stakes, but proactive technical observability goes beyond.
You're looking to answer:
- What do our normal patterns look like?
- When do things begin to degrade?
- Can we detect and act on issues before customers feel them?
Key Areas to Cover:
| Area | What to Look For | Example |
|---|---|---|
| Utilization Metrics | CPU/memory/disk/db/redis/queue throughput | Spike in Redis latency during cache invalidation โ causes slow checkouts |
| Anomaly Detection | Threshold-based alerts on errors, latency, etc. | Surge in 5xxs after a deploy triggers rollback via Datadog monitor |
| Inbound & Outbound Traffic | Monitor upstream/downstream health | A downstream auth service outage surfaces via failed outbound retries |
Product Observability
This is about understanding how users interact with your features โ not just if something broke, but how it's being used, where friction is, and how it contributes to outcomes.
Principles:
- No database deep dives: Product teams should ideally never have to ask engineers to find out what happened.
- Track usage from day one: Feature flags, funnels, and domain-specific events tell the real story.
- Donโt forget unhappy paths: Failures, retries, drop-offs, these are often heavily overlooked in my experience.
Key Areas to Cover:
| Area | What to Look For | Example |
|---|---|---|
| Feature Adoption | Who's using the new dashboard? How often? | Admin dashboard adoption is <5% of users โ maybe we overbuilt it |
| Drop-off Points | Where do people abandon a flow? | 40% of users never complete onboarding โ logs show phone verification failure |
| Operational Visibility | What internal tools help debug user issues? | Admin portal shows "last known state" of a user subscription and events leading to it |
Luke Curtis
Engineering Leader with over 10 years of experience in building and leading high-performing teams. Passionate about transforming organizations through technical excellence and empowered engineering cultures.