AI Inference Routing

AI product traffic enters a routing policy, then splits across a small local model, a frontier GPU pool, an external model API, and a slower CPU batch fallback. Each path has different GPU draw, latency load, cost, and carbon.

Decision logic

Route inference traffic across model tiers, GPU pools, vendor fallback, and batch fallback under GPU, latency, cost, and carbon limits.

Decision question

Which inference serving mix can absorb growth without breaking GPU capacity, latency, budget, or carbon limits?

Model output

Best first unlockExpand local GPU pool

Current served traffic~62k kreq/month

Rejected shortcutCPU fallback breaks latency policy

Active limits

Local GPU pools full
GPU hours fully used
Latency SLA near limit
External API quota tight

Key insight

The cheap fallback path is available, but the latency budget prevents dumping growth there. The first useful unlock is more local GPU pool capacity.

Moving every overflow request to batch fallback looks cheap, but latency load blocks it.

External API burst helps, but quota becomes tight before it absorbs the whole spike.

More local GPU pool capacity lets the router preserve latency while growing served traffic.

Open this model