How to Control AI Agent Costs at Scale
Why AI agent costs matter now
AI agent costs are rising and unpredictable; this guide shows operators how to measure and control ai agent costs across fleets.
Heads of Engineering and FinOps teams running agent fleets face three problems at once: unpredictable model usage, hidden tail costs from long context windows, and feature-level decisions that blow monthly budgets. This guide gives an operator-first playbook: how to measure spend, how to build accountable cost models, which control levers to apply, and a 30-60-90 rollout plan to stop runaway spend while preserving velocity.
References
- ·OpenAI pricing for reference on per-token math: https://openai.com/pricing
- ·FinOps Foundation guide for cloud financial practices: https://www.finops.org/guides/
Measure first: the metrics you must track
Core telemetry
- ·tokens_sent and tokens_received per request. Record prompt tokens, completion tokens, and total costable tokens.
- ·requests_per_minute and requests_per-day, broken out by agent, capability, and user.
- ·model_mix: percentage of queries routed to each model family (gpt-4o, gpt-4, gpt-3.5, etc.).
- ·avg_context_window in tokens per request, and 95th percentile context window.
- ·cache_hit_rate: fraction of requests served from cache or short-circuiting policy.
- ·batch_size and batching_frequency when batch calls are used.
Derived metrics you will use
- ·cost_per_token = provider_price_per_token by model.
- ·cost_per_request = sum(cost_per_token * tokens) + external_call_costs.
- ·cost_per-capability = sum(cost_per_request for capability / number_of_runs).
- ·cost_per-user or per-seat = total_cost_allocated / active_users over window.
- ·tail_risk = proportion of spend from top 5% most expensive requests.
Concrete example: if gpt-4 costs $0.03 per 1K prompt tokens, and your average run uses 1,200 tokens, cost_per_request ≈ 1.2 * $0.03 = $0.036 plus completion cost. Track this daily for each model variant and capability.
Cost accounting models
Pick models that map to how teams consume value. Use one or more of these depending on org structure.
Per-run accounting
Charge back every agent run to a capability bucket. Useful for runtime optimization and for identifying expensive routines.
YAML budget snippet:
capability: "document-summarization"
billing:
model: "gpt-4"
budget_per_run_usd: 0.05
budget_per_month_usd: 500
Per-capability accounting
Aggregate runs by capability, then set monthly budgets. This is the most actionable for platform owners who own services like "meeting-notes" or "triage-bot".
Per-user or per-team accounting
Allocate a monthly quota per user or team. Use for self-service models where product partners need predictable bills.
Hybrid (recommended)
Record per-run, roll up to per-capability, and allocate proportionally to per-user where needed. This keeps granularity for optimization while enabling billing clarity.
Link cost reports to product metrics, for example: cost_per_conversation, cost_per-conversion, cost_per-resolution. Those ratios reveal whether a capability is delivering value relative to cost.
Internal reference for design patterns: agent cost accounting.
Control levers, ranked by impact and operational risk
Apply levers in this order: visibility, routing, caching, truncation, batching, caps, and model distillation.
1. Visibility and alerts
- ·Daily cost dashboards by model, capability, and user.
- ·Alerts for anomalies: >20% day-over-day spend spike, or new capabilities exceeding baseline.
2. Model routing
- ·Route low-sensitivity or high-volume traffic to cheaper models. Track error and quality metrics per route.
- ·Use A/B routing for controlled swaps.
3. Caching and memoization
- ·Cache deterministic outputs for identical inputs.
- ·Maintain cache_hit_rate target, for example 30-50% for templates and reference lookups.
4. Prompt truncation and context pruning
- ·Reduce context windows by summarizing histories, dropping low-value tokens, or cropping long documents before sending.
- ·Maintain a 95th percentile context window SLA.
5. Batching and async calls
- ·Batch similar requests to amortize per-call overhead and better pack tokens into one call.
- ·Consider async workers where latency is not user-visible.
6. Hard caps and soft quotas
- ·Soft quotas warn teams, then roll into hard caps that either throttle or degrade features gracefully.
- ·Example soft warning at 80% of monthly budget, hard cap at 100%.
7. Model distillation and smaller fine-tuned models
- ·Distill common patterns into cheaper models for high-volume, lower-complexity capabilities.
8. Operational playbooks
- ·Rate limit by user, degrade features under budget pressure, and fallback to cached or template responses.
Also embed "agent safety gates" in routing and caps so safety checks and cost controls run together for production workloads: agent safety gates.
A 10-step cost playbook (numbered)
- ·Instrument tokens, requests, model_mix, context_window, and cache_hit_rate at the request level.
- ·Build daily dashboards and a weekly report that shows cost_per-capability and tail_risk.
- ·Identify top 10% of runs by cost and inspect payloads and context.
- ·Apply caching to deterministic capabilities, target cache_hit_rate >= 30%.
- ·Route non-sensitive traffic to cheaper models and monitor quality metrics.
- ·Implement truncation policies for histories over X tokens, precompute summaries.
- ·Introduce batching for inbound events where latency allows.
- ·Set soft quotas and notification rules for owners, then hard caps if overruns continue.
- ·Run a 30-60-90 plan to scale controls without stopping innovation.
- ·Review monthly, update budgets, and publish cost-per-feature so product teams internalize trade-offs.
30-60-90 rollout plan
Days 0–30: Measure and stop the worst leaks
- ·Install request-level telemetry for tokens, models, and context.
- ·Ship dashboards and alerting for top-5 spend anomalies.
- ·Apply caching to the single highest-volume capability.
- ·Expected outcome: reduce anomalous spikes, find top 5% costly runs.
Days 31–60: Apply routing and quotas
- ·Implement model routing rules by capability and confidence bands.
- ·Add soft quotas to product teams, with 80% warning rules.
- ·Begin batching where practical.
- ·Expected outcome: 10–25% steady-state cost reduction, predictable monthly budgets.
Days 61–90: Optimize and institutionalize
- ·Roll out per-capability budgets and automatic hard caps for noncritical workloads.
- ·Deploy distilled models for the top two high-volume use cases.
- ·Publish cost-per-feature dashboards to stakeholders and run a FinOps review.
- ·Expected outcome: operational governance, ongoing 20–40% cost savings target depending on starting efficiency.
Toolkit: how AEGIS OS helps
AEGIS OS provides factual features to implement the playbook.
- ·Observability: request-level telemetry for tokens, model choice, context length, and outcomes, with built-in dashboards and daily exports.
- ·Policy controls: declarative routing and quota policies that can target model families, capabilities, or user groups.
- ·Cost dashboards: per-run, per-capability, and per-user rollups, with anomaly detection and scheduled reports.
- ·Enforced caps: soft and hard quota actions that throttle, degrade features, or reroute to fallback models.
- ·Integration points: provider-cost mapping for common vendors to compute cost_per_request automatically.
These tools map directly to the metrics and levers described earlier. Use them to enforce the YAML or JSON budget snippets in your platform.
Operational checks and governance
- ·Weekly FinOps review with product and platform owners. Use the FinOps Foundation guide for governance templates: https://www.finops.org/guides/.
- ·Change control for routing and caps. Require test deployments and a rollback window.
- ·Postmortems for any unexpected spend spikes, with root cause and action items assigned.
Example quick policies
Policy: per-capability hard cap
{
"capability": "email-digest",
"monthly_cap_usd": 300,
"soft_warning_pct": 80,
"action_at_cap": "throttle_to_cached",
"notify": ["[email protected]","[email protected]"]
}
Policy: routing rule
{
"match": {"capability": "qa-assistant","confidence_lt": 0.7},
"route_to": "gpt-3.5",
"metrics": {"max_context_tokens": 2048}
}
What to measure after changes
Track these KPIs weekly for 12 weeks:
- ·total_monthly_spend
- ·avg_cost_per_request
- ·pct_spend_from_top_5_percent_requests (tail_risk)
- ·cache_hit_rate
- ·model_mix shift percentage
- ·incidents where hard caps impacted production
Closing: trade-offs and final advice
Controlling agent costs is a continuous engineering problem. Prioritize measurement, then low-risk levers like caching and routing. Use budgets and quotas to force product trade-offs, not to stop experimentation. Distillation and model swaps are higher effort but yield recurring savings.
If you want a practical starting point, run the 30-day baseline: instrument telemetry, enable caching for one capability, and set a soft quota. That alone commonly reveals 20–40% of avoidable spend.
Book a demo to see how AEGIS OS instruments and enforces these controls, and how teams cut LLM infra by 20–40% in 30 days.