Skip to main content
New: 9 PM Courses with hands-on exercises and certificates
Back to Glossary
EngineeringS

SLA, SLO, and SLI

Definition

SLA, SLO, and SLI form a three-layer reliability framework that connects engineering metrics to business commitments. SLI (Service Level Indicator) is the measurement: a specific metric like request latency, error rate, or availability. SLO (Service Level Objective) is the internal target: the team agrees that the SLI should meet a specific threshold (e.g., "99.9% of requests complete within 200ms"). SLA (Service Level Agreement) is the external commitment: a contractual promise to customers that the service will meet certain performance standards, often with financial penalties for breach.

Google's Site Reliability Engineering (SRE) team formalized this framework in their 2016 book "Site Reliability Engineering." The hierarchy works bottom-up: SLIs produce data, SLOs interpret that data as pass/fail against targets, and SLAs codify customer-facing promises. The critical insight is the error budget concept. If your SLO is 99.9% availability over 30 days, you have a budget of approximately 43 minutes of downtime. Spending that budget on feature deployment risk is a conscious choice. Exceeding it triggers a reliability-focused response.

The existing SLA glossary entry covers the contractual dimension. This entry focuses on how the three concepts work together as a system. In practice, most engineering teams interact primarily with SLOs and SLIs. The SLA is a business document that the legal and sales teams negotiate, based on what the engineering team says they can reliably deliver.

Why It Matters for Product Managers

This framework gives PMs a structured way to negotiate the tension between shipping features and maintaining reliability. Without SLOs, the conversation is subjective: "the site feels slow" versus "it is fine for most users." With SLOs, the conversation is data-driven: "we are at 99.85% availability this month against a 99.9% target, so we have consumed 65% of our error budget with 10 days remaining."

The error budget model is especially powerful. When the error budget is healthy, the PM can push for faster shipping, riskier experiments, and larger deployments. When the error budget is nearly exhausted, the PM should support the engineering team's request to pause features and invest in technical debt reduction, better testing, or infrastructure improvements. This is not a subjective argument about priorities. It is a measurable policy.

How to Apply It

Start by identifying the SLIs that matter for your product. For a web application, availability and latency percentiles are typical starting points. For an API product, add error rate and throughput. Then set SLOs that are achievable but ambitious. A 99.9% availability SLO means roughly 8.7 hours of downtime per year. A 99.99% SLO means roughly 52 minutes per year. The right target depends on customer expectations, competitive norms, and your team's DevOps maturity. Track SLO performance on a dashboard visible to both engineering and product, and use the error budget as the decision framework for balancing feature work versus reliability investment. Review your team's approach alongside the North Star Framework to ensure reliability goals align with your product's core success metrics.

Frequently Asked Questions

What is the difference between SLO and SLA?+
An SLO (Service Level Objective) is an internal target your team sets for service reliability, for example 'the API will respond within 200ms for 99.9% of requests.' An SLA (Service Level Agreement) is a contractual commitment to customers with financial consequences for breach, for example 'we guarantee 99.95% uptime or we credit 10% of your monthly bill.' SLOs are always stricter than SLAs because you need a buffer. If your SLA promises 99.95% uptime, your SLO should target 99.99% so you have room to catch issues before they breach the customer commitment.
What are good SLI metrics to track?+
The four most common SLIs are availability (percentage of successful requests), latency (response time at various percentiles like p50, p95, p99), throughput (requests per second the system handles), and error rate (percentage of requests that return errors). Choose SLIs that reflect what users actually experience. A backend service might be 'available' (returning 200 status codes) but functionally broken if latency is 30 seconds. Track latency percentiles, not just averages, because averages hide the experience of your worst-affected users.
How do SLOs affect product decisions?+
SLOs create an error budget, the allowed amount of unreliability before the team must prioritize stability over features. If your SLO is 99.9% availability (allowing 43 minutes of downtime per month) and you have consumed 40 minutes by day 20, the team should freeze feature deployments and focus on reliability. This gives PMs a data-driven framework for balancing feature velocity against stability. Google's SRE team pioneered this approach, and it is now standard practice at companies that take reliability seriously.

Explore More PM Terms

Browse our complete glossary of 100+ product management terms.