AI Inference Routing
Inference traffic routes across four serving paths: a local small model, a local frontier pool, an external API, and a CPU batch fallback. The cheap path (CPU fallback) is too slow for the latency SLA. The fast paths are running into GPU-hour and quota limits.
Which serving mix can absorb more traffic without breaking GPU-hour capacity, latency policy, vendor quota, budget, or carbon limits?
Traffic enters an inference-routing source and branches across local small-model, local frontier, external API, and CPU fallback paths. Resource rates track GPU hours and latency load; cost and carbon rates sit on each path. The solve maximizes served requests and reports which policy or capacity limits prevent additional routing.
- Local GPU pools full
- GPU hours fully used
- Latency policy near limit
- External API quota tight
The lower-cost fallback path is available, but the latency budget prevents routing most growth there. The first constraint to relieve is local GPU pool capacity.