Definition
Inference is the phase where a trained AI model processes new inputs to produce outputs: predictions, classifications, generated text, images, or any other model output. When a user asks ChatGPT a question, the servers run inference on the GPT model to generate a response. When your product calls the Claude API, inference happens on Anthropic's infrastructure. The term distinguishes the "using" phase from the "building" (training) phase of the model lifecycle.
Inference performance is characterized by three metrics: latency (time to generate a response), throughput (requests processed per second), and cost (dollars per token or per request). These three factors are in constant tension. Lower latency usually means more expensive hardware. Higher throughput often requires batching, which increases latency. Optimizing all three simultaneously is a core infrastructure challenge for AI product teams.
For products using third-party APIs (OpenAI, Anthropic, Google), inference costs show up directly on your bill. For self-hosted models, inference costs appear as GPU compute bills from cloud providers. Either way, inference is typically the largest line item in an AI product's cost structure, often exceeding all other infrastructure costs combined. You can estimate these costs for your use case with the LLM Cost Estimator.
Why It Matters for Product Managers
Inference economics reshape how PMs think about feature design. In traditional SaaS, serving one more user costs almost nothing. In AI-powered products, every user request has a measurable cost. This changes pricing, packaging, and even UX design. Features that trigger fewer or shorter inference calls are cheaper to operate. Streaming responses (showing tokens as they generate) improves perceived latency without changing actual cost.
PMs need to track inference metrics alongside product metrics. If your p95 inference latency is 3 seconds but users expect instant responses, you have a UX problem that no amount of prompt optimization can fix. If inference costs per user exceed $5/month on a $20/month plan, your contribution margin is under pressure. These are product decisions, not just engineering concerns.
How to Apply It
Build inference awareness into your product planning process. Every AI feature should have an estimated inference cost and latency target before development begins.
Steps for PMs managing AI products:
- ☐ Estimate per-request inference cost for each AI feature (model + tokens + frequency)
- ☐ Set latency SLAs for user-facing inference (e.g., < 2 seconds for chat, < 500ms for autocomplete)
- ☐ Implement model routing: use cheaper models for simple tasks, premium models for complex ones
- ☐ Build caching layers for common queries to reduce redundant inference
- ☐ Monitor inference cost per user and per feature to catch cost drift early
- ☐ Factor inference costs into your pricing model, especially for usage-based tiers