← AI Infra
Model Serving

AI Inference Routing

Situation

Inference traffic routes across four serving paths: a local small model, a local frontier pool, an external API, and a CPU batch fallback. The cheap path (CPU fallback) is too slow for the latency SLA. The fast paths are running into GPU-hour and quota limits.

Decision

Which serving mix can absorb more traffic without breaking GPU-hour capacity, latency policy, vendor quota, budget, or carbon limits?

How we modeled it

Traffic enters an inference-routing source and branches across local small-model, local frontier, external API, and CPU fallback paths. Resource rates track GPU hours and latency load; cost and carbon rates sit on each path. The solve maximizes served requests and reports which policy or capacity limits prevent additional routing.

What the model shows
First constraint to relieveExpand local GPU pool
Current served traffic~62k kreq/month
Not feasibleCPU fallback exceeds latency policy
Active limits
  • Local GPU pools full
  • GPU hours fully used
  • Latency policy near limit
  • External API quota tight
What this shows

The lower-cost fallback path is available, but the latency budget prevents routing most growth there. The first constraint to relieve is local GPU pool capacity.

Moving overflow to batch fallback is lower-cost, but the latency budget prevents it.
External API burst helps, but quota tightens before it absorbs the whole increase.
More local GPU capacity lets the router preserve latency while growing served traffic.