Skip to content

    When volume proves out.

    Reserved throughput.Same API key

    Dedicated GPUs for your models and fine-tunes, using the same account and API key.

    When your agent loop becomes production traffic: move to isolated GPUs with stable latency, predictable cost, and the same key.

    Find your break-even

    Serverless per-token vs a reserved GPU at a flat hourly rate. Pick a model and your monthly volume.

    Serverless · $10/mo Dedicated (1× NVIDIA H100 SXM) · $1,533/mo

    At 500M tokens/month, serverless is cheaper by $1,523/month.

    Crossover: dedicated wins above ~74.06B tokens/month for this model.

    When dedicated wins

    Steady-volume economics

    Past your break-even volume, a reserved GPU at a flat hourly rate undercuts per-token serverless pricing. Per-second GPU billing, with no extra idle surcharge.

    Consistent latency and isolation

    Isolated GPUs with predictable performance and no shared rate limits, so latency-sensitive workloads stay steady under load.

    Custom and fine-tuned models

    Deploy your own fine-tuned checkpoints on the same managed endpoints, beyond the open catalog.

    What you get

    The dedicated mechanics, exactly as priced on the Dedicated tab today.

    • Reserved NVIDIA GPUs with NVLink and InfiniBand, on-demand or reserved
    • Per-second GPU billing, with no extra idle surcharge
    • The same OpenAI-compatible endpoint and API key as serverless
    • Region selection and one-click setup
    • Managed fine-tuning: per-M training tokens plus storage
    • Observability and audit on every endpoint

    Models on dedicated

    Open models sized for reserved deployment, plus your own fine-tuned models.

    Bring your fine-tuned models

    Deploy your own fine-tunes and LoRA adapters on dedicated endpoints. Stack and swap adapters without retraining the base model.

    Fine-tuning

    More models

    How it works

    1

    Pick model and hardware

    Choose an open model or your fine-tune, and the GPU configuration the catalog sizes for it.

    2

    FlexAI provisions and manages

    We stand up the reserved endpoint, scale it, and keep it healthy. No infrastructure to run.

    3

    Same API, your endpoint

    Call it through the same OpenAI-compatible key. Move from serverless without migrating.

    Same account for serverless inference and training.

    Economics

    Per GPU-hour on FlexAI's NVIDIA and AMD fleets, with per-second GPU billing.

    • B200$6.25/hr
    • H200$3.15/hr
    • H100$2.10/hr
    See all dedicated GPU rates

    Open the full crossover calculator

    Customers

    Built for production workloads

    FlexAI provides a much more cost-effective & hassle-free experience for training & deploying my models.
    75%
    Lower compute cost

    Frequently Asked Questions

    Reserved throughput, when you're ready for it

    Reserved GPUs when volume proves out, serverless until then. One account throughout.

    $10/month in free credits for your first 3 months