How is a dedicated endpoint billed?

Per GPU-hour on reserved NVIDIA GPUs, with per-second GPU billing and no extra idle surcharge. On-demand and reserved options are available.

Can I run serverless and dedicated on one account?

Yes. Dedicated endpoints use the same account and the same OpenAI-compatible API key as Token Factory serverless. Move between them freely.

Can I deploy a fine-tuned model?

Yes. Your fine-tuned checkpoints deploy on the same managed dedicated endpoints. Managed fine-tuning is available, billed per million training tokens plus storage.

What if I outgrow or shrink?

Scale the reserved configuration up or down. Below your model's break-even volume, serverless per-token pricing is cheaper; above it, dedicated wins. The widget above shows your crossover.

How is this different from AI Factory?

Dedicated endpoints are our cloud reserved for you. AI Factory is our stack deployed on your hardware.

When volume proves out.

Reserved throughput.Same API key

Dedicated GPUs for your models and fine-tunes, using the same account and API key.

Get started Find your break-even

When your agent loop becomes production traffic: move to isolated GPUs with stable latency, predictable cost, and the same key.

>99.9%

Uptime

DragonLLM runs private finance inference on dedicated endpoints: autoscaling, scale-to-zero, hosted in France.

DragonLLM

75%

Lower compute cost

LegML, fine-tuned and served on FlexAI.

LegML

Find your break-even

Serverless per-token vs a reserved GPU at a flat hourly rate. Pick a model and your monthly volume.

ModelMonthly volume: 500M tokens

Serverless · $12/mo Dedicated (1× NVIDIA H100 SXM) · $1,533/mo

At 500M tokens/month, serverless is cheaper by $1,522/month.

Crossover: dedicated wins above ~66.65B tokens/month for this model.

Talk to us Open the full calculator

When dedicated wins

Steady-volume economics

Past your break-even volume, a reserved GPU at a flat hourly rate undercuts per-token serverless pricing. Per-second GPU billing, with no extra idle surcharge.

Consistent latency and isolation

Isolated GPUs with predictable performance and no shared rate limits, so latency-sensitive workloads stay steady under load.

Custom and fine-tuned models

Deploy your own fine-tuned checkpoints on the same managed endpoints, beyond the open catalog.

What you get

The dedicated mechanics, exactly as priced on the Dedicated tab today.

Reserved NVIDIA GPUs with NVLink and InfiniBand, on-demand or reserved
Per-second GPU billing, with no extra idle surcharge
The same OpenAI-compatible endpoint and API key as serverless
Region selection and one-click setup
Managed fine-tuning: per-M training tokens plus storage
Observability and audit on every endpoint

Models on dedicated

Open models sized for reserved deployment, plus your own fine-tuned models.

DeepSeek V3.216× H100 DeepSeek R116× H100 GPT-OSS 20BH100 Llama 4 Maverick16× H100 Qwen3 Coder 480B A35B8× H100 Qwen3 235B A22B 25074× H100

Bring your fine-tuned models

Deploy your own fine-tunes and LoRA adapters on dedicated endpoints. Stack and swap adapters without retraining the base model.

Fine-tuning

More models

How it works

1

Pick model and hardware

Choose an open model or your fine-tune, and the GPU configuration the catalog sizes for it.

2

FlexAI provisions and manages

We stand up the reserved endpoint, scale it, and keep it healthy. No infrastructure to run.

3

Same API, your endpoint

Call it through the same OpenAI-compatible key. Move from serverless without migrating.

Same account for serverless inference and training.

Economics

Per GPU-hour on FlexAI's NVIDIA and AMD fleets, with per-second GPU billing.

B200$6.25/hr
H200$3.15/hr
H100$2.10/hr

See all dedicated GPU rates

Open the full crossover calculator

One account, the whole way

Dedicated endpoints are rung three. Same account from your first request to your private cloud.

Customers

Built for production workloads

“FlexAI provides a much more cost-effective & hassle-free experience for training & deploying my models.”