
What
Software quality isn’t just about shipping features fast, it’s about building trust through reliability. The CAPS framework (Correctness, Availability, Performance, Security) helps engineering leaders define and measure technical health in a way that aligns with business outcomes.
Much of this may feel like common sense—but CAPS gives teams the data and structure to consistently uphold those best practices.
A solid CAPS strategy is really beneficial to ensuring strong trust and transparency with stakeholders
The following are stages for considering each area of the CAPS strategy in a maintainable state. Using the Six Sigma DMAIC model is a good lens to look through this
Define - What are the definitions of success for this particular area of the strategy, what should we be looking at?
Measure - How are we measuring it? What infrastructure do we need to support measuring this?
Analyse - We now have the data we need, what common themes are we seeing? What have we learnt since we defined our metrics?
Improve - Iterating further on the results of the previous step we can make tweaks to how we measure
Control - Finally, we’re in a position to set standards of what the boundaries of a successful CAPS strategy looks like for the team.
How
Each area of caps would require an in depth look into “what” they mean for the business, however as a general rule of thumb, the below themes have helped me define what that looks like in a technical context.
| Strategy | Themes |
|---|---|
| Correctness | Error rates, failed business logic paths, and regressions in core user flows |
| Availability | SLO, SLA & SLI indicators ( e.g. uptime monitors, synthetic tests) |
| Performance | P95, P75s, throughput, load testing, outages |
| Security | CVEs, Incident management and resolution strategy, Audit-ability, Security reviews, Penetration testing |
CAPS isn’t just a checklist—it’s a mindset. By investing in these four pillars early, engineering teams can move faster with confidence, reduce firefighting, and build systems that scale
Availability Bonus Content
SLOs, SLAs and SLIs
Thinking a little bit deeper about Availability is something I've found to offer really solid insights into the over all health of the engineering efforts your team undertakes, so I've opted to extend this post slightly to go a bit deeper on SLO, SLAs, and SLIs.
Service Level Objectives (SLOs), Service Level Agreements (SLAs) & Service Level Indicators (SLIs) all holistically come together to perform one key wider theme, availability. By using the nuances of each of these different types of metrics a team (and wider observers) can have a fine tuned understanding of the health of the systems they are operating.
There is already a multitude of documentation surrounding what these could look like for businesses and also how to get there. One such website is SLODLC who, personally for me, is a great starting point to get “sensible defaults” for and have really tight feedback loops for defining what success looks like through this lens.
A quick TL;DR of each and the nuances of the three definitions
SLI - The measurable indicator (e.g. order processing latency) - An indicator of a part of your system to make up what constitutes the SLO. An example might be if you owned a e-commerce store, how long does it take to process an order once payment has been made?
SLO - The internal target (e.g. 99% of orders processed < 1 minute) - An objective (usually a %) that is agreed by the whole team to consistently hit for uptime or delivery of a feature. An example might be, taking the e-commerce store one step further, all order processing must be performed within 1 minute. Any deviation above this effects the % of the SLO, if 99 orders processed under 1 minute, but one order took 2, the SLO is running at 99% for the period.
SLA - The external commitment, often contractual (e.g. 97% uptime guarantee) - Using the SLOs you make an agreement with key stakeholders to guarantee a certain level of service, breaches of this are usually monetary fines, discounts or in the worst cases grievance procedures. An example of this would be a contract in place saying that 97% of all orders will be processed under a minute for your e-commerce store.
What does implementation look like?
There’s no one-size-fits-all implementation. Tools like Datadog, New Relic, or custom dashboards can support CAPS strategies—but the real value comes from teams engaging with the data and adjusting as they grow.
Luke Curtis
Engineering Leader with over 10 years of experience in building and leading high-performing teams. Passionate about transforming organizations through technical excellence and empowered engineering cultures.