Skip to main content
New: 9 PM Courses with hands-on exercises and certificates
Back to Glossary
AI and Machine LearningI

Inference

Definition

Inference is the phase where a trained AI model processes new inputs to produce outputs: predictions, classifications, generated text, images, or any other model output. When a user asks ChatGPT a question, the servers run inference on the GPT model to generate a response. When your product calls the Claude API, inference happens on Anthropic's infrastructure. The term distinguishes the "using" phase from the "building" (training) phase of the model lifecycle.

Inference performance is characterized by three metrics: latency (time to generate a response), throughput (requests processed per second), and cost (dollars per token or per request). These three factors are in constant tension. Lower latency usually means more expensive hardware. Higher throughput often requires batching, which increases latency. Optimizing all three simultaneously is a core infrastructure challenge for AI product teams.

For products using third-party APIs (OpenAI, Anthropic, Google), inference costs show up directly on your bill. For self-hosted models, inference costs appear as GPU compute bills from cloud providers. Either way, inference is typically the largest line item in an AI product's cost structure, often exceeding all other infrastructure costs combined. You can estimate these costs for your use case with the LLM Cost Estimator.

Why It Matters for Product Managers

Inference economics reshape how PMs think about feature design. In traditional SaaS, serving one more user costs almost nothing. In AI-powered products, every user request has a measurable cost. This changes pricing, packaging, and even UX design. Features that trigger fewer or shorter inference calls are cheaper to operate. Streaming responses (showing tokens as they generate) improves perceived latency without changing actual cost.

PMs need to track inference metrics alongside product metrics. If your p95 inference latency is 3 seconds but users expect instant responses, you have a UX problem that no amount of prompt optimization can fix. If inference costs per user exceed $5/month on a $20/month plan, your contribution margin is under pressure. These are product decisions, not just engineering concerns.

How to Apply It

Build inference awareness into your product planning process. Every AI feature should have an estimated inference cost and latency target before development begins.

Steps for PMs managing AI products:

  • Estimate per-request inference cost for each AI feature (model + tokens + frequency)
  • Set latency SLAs for user-facing inference (e.g., < 2 seconds for chat, < 500ms for autocomplete)
  • Implement model routing: use cheaper models for simple tasks, premium models for complex ones
  • Build caching layers for common queries to reduce redundant inference
  • Monitor inference cost per user and per feature to catch cost drift early
  • Factor inference costs into your pricing model, especially for usage-based tiers

Frequently Asked Questions

What is the difference between training and inference?+
Training is the process of building the model: feeding it data, adjusting billions of parameters, and running thousands of iterations over weeks or months on expensive GPU clusters. Training GPT-4 reportedly cost over $100M. Inference is the process of using the trained model to generate outputs from new inputs. Every time you send a prompt to ChatGPT or Claude, that is inference. Training happens once (or periodically). Inference happens every time a user makes a request.
Why does inference cost matter for product managers?+
Inference cost is the primary variable cost for AI products. Every API call, every generated response, every image created is an inference operation with a real dollar cost. A product serving 100,000 users making 5 requests per day at $0.003 per request generates $15,000 in daily inference costs. These costs scale linearly with usage, unlike traditional software where marginal cost approaches zero. PMs must factor inference costs into pricing, usage limits, and tier design.
How can teams reduce inference costs?+
Five main approaches. Model distillation: train a smaller, cheaper model to mimic the larger model's outputs for your specific use case. Caching: store and reuse outputs for identical or near-identical inputs. Batching: group multiple requests for more efficient GPU utilization. Model selection: route simple tasks to cheap models and only use expensive models for complex tasks. Quantization: reduce model precision (e.g., from FP16 to INT8) to run on cheaper hardware with minimal quality loss.

Explore More PM Terms

Browse our complete glossary of 100+ product management terms.