← Cloud & AI Infra

Example

SLA-constrained routing

AI Inference Routing

AI product traffic enters a routing policy, then splits across a small local model, a frontier GPU pool, an external model API, and a slower CPU batch fallback. Each path has different GPU draw, latency load, cost, and carbon.

Decision logic

Route inference traffic across model tiers, GPU pools, vendor fallback, and batch fallback under GPU, latency, cost, and carbon limits.

Decision question

Which inference serving mix can absorb growth without breaking GPU capacity, latency, budget, or carbon limits?

Model output
Best first unlockExpand local GPU pool
Current served traffic~62k kreq/month
Rejected shortcutCPU fallback breaks latency policy
Active limits
  • Local GPU pools full
  • GPU hours fully used
  • Latency SLA near limit
  • External API quota tight
Key insight

The cheap fallback path is available, but the latency budget prevents dumping growth there. The first useful unlock is more local GPU pool capacity.

Moving every overflow request to batch fallback looks cheap, but latency load blocks it.
External API burst helps, but quota becomes tight before it absorbs the whole spike.
More local GPU pool capacity lets the router preserve latency while growing served traffic.