Definition
SLA, SLO, and SLI form a three-layer reliability framework that connects engineering metrics to business commitments. SLI (Service Level Indicator) is the measurement: a specific metric like request latency, error rate, or availability. SLO (Service Level Objective) is the internal target: the team agrees that the SLI should meet a specific threshold (e.g., "99.9% of requests complete within 200ms"). SLA (Service Level Agreement) is the external commitment: a contractual promise to customers that the service will meet certain performance standards, often with financial penalties for breach.
Google's Site Reliability Engineering (SRE) team formalized this framework in their 2016 book "Site Reliability Engineering." The hierarchy works bottom-up: SLIs produce data, SLOs interpret that data as pass/fail against targets, and SLAs codify customer-facing promises. The critical insight is the error budget concept. If your SLO is 99.9% availability over 30 days, you have a budget of approximately 43 minutes of downtime. Spending that budget on feature deployment risk is a conscious choice. Exceeding it triggers a reliability-focused response.
The existing SLA glossary entry covers the contractual dimension. This entry focuses on how the three concepts work together as a system. In practice, most engineering teams interact primarily with SLOs and SLIs. The SLA is a business document that the legal and sales teams negotiate, based on what the engineering team says they can reliably deliver.
Why It Matters for Product Managers
This framework gives PMs a structured way to negotiate the tension between shipping features and maintaining reliability. Without SLOs, the conversation is subjective: "the site feels slow" versus "it is fine for most users." With SLOs, the conversation is data-driven: "we are at 99.85% availability this month against a 99.9% target, so we have consumed 65% of our error budget with 10 days remaining."
The error budget model is especially powerful. When the error budget is healthy, the PM can push for faster shipping, riskier experiments, and larger deployments. When the error budget is nearly exhausted, the PM should support the engineering team's request to pause features and invest in technical debt reduction, better testing, or infrastructure improvements. This is not a subjective argument about priorities. It is a measurable policy.
How to Apply It
Start by identifying the SLIs that matter for your product. For a web application, availability and latency percentiles are typical starting points. For an API product, add error rate and throughput. Then set SLOs that are achievable but ambitious. A 99.9% availability SLO means roughly 8.7 hours of downtime per year. A 99.99% SLO means roughly 52 minutes per year. The right target depends on customer expectations, competitive norms, and your team's DevOps maturity. Track SLO performance on a dashboard visible to both engineering and product, and use the error budget as the decision framework for balancing feature work versus reliability investment. Review your team's approach alongside the North Star Framework to ensure reliability goals align with your product's core success metrics.