What latency can I expect with FlexAI inference?

FlexAI delivers sub-100ms latency for most models. H100-based endpoints achieve ~934ms end-to-end for LLaMA 3.1 8B, while H200 delivers ~679ms. Streaming further reduces perceived latency.

How does auto-scaling work?

FlexAI automatically scales GPU endpoints based on request volume. Scale from zero to hundreds of GPUs with no manual intervention — you only pay for compute used during active requests.

Which models are supported for inference?

FlexAI supports any model that runs on vLLM, TGI, or custom containers — including LLaMA, Mistral, Qwen, Stable Diffusion, and custom fine-tuned models.

Inference that lets you ship a feature

Get to prod Deploy our blueprints

Serverless

→

Sub-100ms TTFT for interactive workloads

Batch

→

Async execution with queued jobs and callbacks

Dedicated

→

Isolated GPUs with predictable performance

Your endpoint should be a functionnot a lifestyle

# inference in one file
from flexai import endpoint

@endpoint(
  latency_target_ms=180,
  budget_monthly_usd=5000,
  scale=(0, 512),
  gpu="best_for_workload",
  region="closest"
)
def chat(req):
  model = load("your-model")
  return stream(model.generate(req.prompt))

# deploy
flexai deploy chat

Looks simple because it should be.

Production visibility without guesswork

Understand what happened without digging through tools and logs.

Live usage metrics

Latency, tokens, GPU, region, cache. In one view.

Cost lens

What is expensive, what is waste, what to fix next.

AutoScaling

Scale Fractional or Full GPUs based on traffic.

FlexAI Console — workload management dashboard

Production ready blueprints

View all→

Generate Video with Wan2.2inference

→

Generate Audio with Stable Audio Openinference

→

Generate Images with Stable Diffusion 3.5inference

→

Deploy Speech-to-Text Transcriptioninference

→

Build a RAG App with LangChainapp

→

Deploy Multi-Agent LangGraph Systemsapp

→

Ship inference like it isa feature, not a project

Start deploying Open blueprints Calculate your crossover

Built for teams shipping real inference to real users. Sizing a workload? Compare costs in the GPU savings calculator or see bare metal for dedicated capacity.

flexai deploy \
  --workload inference \
  --constraint latency=p95:180ms \
  --constraint budget=$5000 \
  --constraint scale=0..512 \
  --constraint gpu=best

FAQ

The questions you will ask anyway

Do I have to pick a cloud⌄

No. You pick outcomes. FlexAI figures out placement.

Is this only for LLMs⌄

Any inference workload that needs predictable latency, throughput, or cost control.

What if my traffic is spiky⌄

That is the whole point. Scale up fast, then back down.

Can I start from a blueprint⌄

Yes. Treat it like a shortcut, not a tutorial.

Docs

Everything you need to deploy without reading a manifesto.

Start