How to Evaluate an Inference Provider: 5 Checks to Run

TL;DR: Run five checks on the provider's actual endpoint before you commit: does it emit valid tool calls, what model is it really serving you, is that model reliably up, how does it perform on your prompt, and can you leave when you outgrow it. Why: a leaderboard scores a model on a single fixed serve at a reference precision. Your provider runs its own serving stack, its own quantization, its own uptime. We tested this. The same Llama-3.3-70B returned valid structured tool calls only ~37% and ~41% of the time on two providers' full-precision serves, and 100% on a quantized serve. Across 2,100 correctness-graded calls, quantization down to fp4 showed zero degradation. The model was identical across all three routes, so the serve is the variable. Here are the five checks to run on the endpoint before you commit.

Before you route a production workload to an inference provider, test the provider's actual endpoint, not its benchmark. The benchmark measured the model once, on one reference serve someone else controls. Your pipeline hits a different serve, with a different serving stack, and that route is the variable nobody tests. The five checks below run on the endpoint itself, where reliability, cost, and lock-in actually live.

We ran this because we had to. We vet serves before we put them in front of anyone, serve after serve, and every time we trusted a public number over our own probe, the endpoint proved us wrong. This checklist came out of that repetition, not one benchmark surprise. It's model-agnostic and provider-agnostic. Run it against any provider you're evaluating, including us.

One thing up front, because it's the whole point. A public benchmark like BFCL does test tool-call emission, and it's a good benchmark. But it scores each model once, on one serve Berkeley controls, at a reference precision. That number doesn't transfer to your provider's endpoint, and it says nothing about whether the same call works the tenth time you make it. You don't ship the leaderboard, you ship the endpoint. So test it.

Does the provider's serve actually emit valid tool calls?

This is the check that matters most for agent workloads, and it's the one the leaderboard hides.

Here's what we found. We took the same model, Llama-3.3-70B, and ran the same tool schema against three different serving routes. Two were "full-precision" serves from two different providers. One was a quantized serve. The two full-precision routes returned valid structured tool calls only about 37% and 41% of the time. The malformed rest came back with tool markup dumped into the content field, or arguments that didn't match the schema. The quantized route returned valid calls 100% of the time.

Read that again, because the instinct is to blame quantization, and that's exactly backwards. The clean serve was the quantized one. We ran 2,100 correctness-graded tool calls comparing full precision against quantization down to fp4, and quantization showed zero degradation. Zero. Same correct tool, same values, same trajectory. The serving route broke tool calling. Precision had nothing to do with it.

That's why the check has to run on the actual endpoint. A model that scores well on a leaderboard can still hand you 37% valid calls on a specific provider's serve, and no public number will warn you.

How to run it: take your real tool schema, not a toy get_weather. Fire N repeats of a representative call at each provider you're evaluating. Count the fraction that return a well-formed tool_calls object with arguments that validate against your schema. That fraction is your number. If it's not near 100% on your schema, the serve isn't ready for an agent loop, no matter what the model scored somewhere else.

One nuance worth knowing before you write a provider off: some serves won't volunteer a call at tool_choice: auto but emit cleanly at tool_choice: required. Try required before you conclude the serve is broken. If it still dumps raw tool markup into content even at required, that's a missing tool-call parser on the serve, and that's a real problem.

What model is the provider actually serving you?

You asked for a model. Read the served-model identifier on the response and confirm you got it.

Providers commonly serve an FP8 build of the model you named, and that's usually fine (in our own tests, quantization down to fp4 was near-lossless on tool correctness). So this check isn't "catch them quantizing you." It's "know what actually ran."

The reason it matters is reproducibility, not accuracy. When a serve changes underneath you, or a provider swaps a build, you want to see it in the logs, not discover it three weeks later when behavior drifts and you have no idea what moved. The transparency is the point. A serve that tells you what it's running lets you pin it. A serve that doesn't leaves you debugging in the dark.

How to run it: log the served model identifier and precision on every call, or at least sample it. If the provider won't tell you what build you're hitting, that opacity is the finding.

Is the model reliably served, or just listed?

A model in the catalog is a marketing claim. A model that answers 30 calls in a row is a fact. The gap between them is where production breaks.

From our own testing: a ~235B model sat right there in the catalog and returned a 503 on every single one of 30 calls. It was disabled on the serve and still listed. Another model errored on about 80% of calls, 24 out of 30. A third intermittently threw 500s. None of that shows up on a features page. All of it shows up in your error budget.

This is the failure mode that's easiest to miss in evaluation, because your first test call might get lucky. One green call only tells you it answered once. Reliability takes a batch of calls.

How to run it: fire N calls, more than a handful, and read the error rate. Watch for flapping: a serve that works, then 500s, then works again. Intermittent is worse than dead, because dead you catch immediately and intermittent you ship.

How does it perform on your prompt, not the headline?

The latency number on a provider's page is close to useless for your workload. It was measured on someone else's prompt, at someone else's concurrency, on a serve that may not even be the one you get. Don't trust it. Measure it, on your prompt, at your load. Here's the full protocol, because "it feels fast" is not a measurement.

Measure four things, and measure them at percentiles, not averages:

Time to first token (TTFT). How long until the stream starts. It's what a user feels as "is it working." Record p50 and p95, not the mean.
Inter-token latency / tokens per second. How fast the stream flows once it starts. A good TTFT with a slow token rate still feels sluggish on a long answer.
End-to-end latency. TTFT plus generation, the total wall-clock for a real response. p50 is the typical case; p95 and p99 are what a bad minute looks like.
Throughput under concurrency. Requests per second and tokens per second when you fire N calls at once, not one at a time. A serve that's quick single-threaded can collapse at 20 concurrent.

The percentiles are the whole point. A serve with a great p50 and an ugly p95 feels fine in a demo and falls over in production, because your users hit the tail, not the median. Watch the spread between p50 and p95. A wide gap means the serve gets unpredictable under load, and unpredictable is what pages you at 2am.

Two more things the headline never tells you:

Cold start. The first call after the serve has been idle is often far slower than a warm one, so measure a cold call, not just a warmed-up loop.
Rate limits. Deliberately exceed the provider's limit and watch what happens. A clean 429 you can back off from is fine. A silent queue that adds seconds, or a hard error that drops the request, is a production hazard you want to find in testing, not in an incident.

We deliberately don't publish our own latency numbers here, because a first-party latency figure is exactly the kind of headline this check tells you to ignore. The method is the portable part, and it's the read we run per serve internally before we ship one.

How to run it: replay a representative prompt at production length. Sweep a concurrency ladder: one call, then your real expected concurrency, then double it. At each level record p50 / p95 / p99 for TTFT and end-to-end latency, plus tokens per second. Fire one cold call after an idle gap. Then push past the rate limit on purpose and note the behavior. Pass condition: the p95 holds at your real concurrency, the p50-to-p95 spread stays tight, and the rate-limit path is a graceful 429 rather than a stall or a dropped request.

Can you leave, and can you grow without leaving?

The first four checks tell you whether a provider works today. This one tells you whether you'll still want it in a year, and what it costs to change your mind. Lock-in shows up three ways: the exit (how hard it is to switch away), the switching costs that build up as you adopt more of a provider (most of them invisible until you try to leave), and the ceiling (whether the provider grows with you or forces a migration when your needs change).

One scope note: this check is not about cost per token. A model that's cheap per token but wrong or slow costs you retries you never see on the price sheet, but that's a model-and-task question for the model checklist. The provider question is whether you can leave, and grow without having to.

Test the exit directly. Point your existing OpenAI-compatible client at the provider with nothing but a base URL and a key. If your code runs unchanged, leaving later is cheap, because switching back out is the same one-line change. If you had to rewrite your integration around a proprietary API, you're half-locked-in before you've even shipped. Same question for your artifacts: if you fine-tune, do you get the weights, or are they trapped on the platform?

Then look at the switching costs that build up as you adopt more of a provider. Most stay invisible until the day you try to leave. Four that teams underestimate:

Data gravity and egress. The more of your data, logs, and fine-tuning sets you park in one cloud, the more it costs to pull them out, sometimes literally, as egress fees to move your own data. On the big clouds especially, the more you adopt, the more you're taxed to leave. That isn't an accident, it's an exit fee, and it grows every month you stay.
API and framework parity. A one-line base-URL swap only saves you if the destination supports what you built on. Teams underestimate this: even moving between two managed closed-source APIs can be painful when the target is missing features you depend on. And if you wired your prompts and function-calling schemas to one model family's quirks, or built on a provider's proprietary agent SDK, the lock is in your integration, not your invoice.
A model you depend on can disappear. Managed providers deprecate and pull models, sometimes an open one you were running in production, on a short migration window. If your provider is your only route to a given model, its roadmap is your roadmap.
A deployment layer above the API. The API itself is portable; a proprietary deployment DSL or config layer stacked on top of it is not. Adopt that convenience and leaving turns back into a rewrite.

The through-line: portability you don't exercise erodes. The base URL is portable. The operational layers you build on top of a provider are the part that's hard to move, so pick a provider that keeps them yours.

Then test the ceiling, the one people forget. Most providers do one slice of the stack. Serverless pay-per-token is great until you need consistent latency and reach for dedicated capacity, or your data gets sensitive and you want it on your own hardware, or the off-the-shelf model isn't good enough and you need to fine-tune. If your provider only does the one slice, each of those is a migration: new platform, new integration, new eval. A provider you can grow into, from serverless to dedicated endpoints to fine-tuning to self-hosting on the same account and the same API, is one you don't have to leave to grow. Run this check on everyone, us included, because a serverless endpoint that's cheap in your bake-off can be the one you tear out in six months.

How to run it: do the base-URL swap for real, on the provider you're evaluating and on your current one, and count the lines you had to change. Read the fine print on getting your data and fine-tuned weights back out, and what it costs to move them. Check whether the provider pushes a proprietary layer above the standard API. Then map your next twelve months, dedicated capacity, fine-tuning, on-prem, whatever's plausible, and ask which of those the provider supports without a migration. The steps it doesn't support, and the data you can't cheaply extract, are your future rip-and-replace.

The whole recipe on one page

Run all five against any provider's endpoint before you commit. None of it needs the provider's cooperation beyond an API key.

Check	What it catches	How to run it	Pass
1. Valid tool calls	serve returns prose or raw markup instead of structured calls	your real tool schema, N repeats per provider; count valid `tool_calls`	~100% valid on your schema
2. Served-model identity	silent build / precision swap you can't see	log the served-model id + precision on every call	the provider tells you the exact build
3. Availability	"listed" but 503s or flaps under real traffic	N calls, read the error rate, watch for flapping over a window	low error rate, no flapping
4. Performance	good average, ugly tail, cold starts, rate-limit stalls	p50/p95/p99 for TTFT + end-to-end and tok/s across a concurrency ladder; one cold call; exceed the rate limit	p95 holds at your concurrency, tight p50-to-p95 spread, graceful 429
5. Lock-in & growth	proprietary API, egress / data-gravity exit fees, or a one-slice platform you'll outgrow	base-URL swap (count the lines changed); check data + weight export and egress terms; map your next 12 months (dedicated, fine-tune, on-prem) vs what the provider supports	one-line switch-out, your data and weights portable, grows with you without a migration

If a provider fails Check 1 or Check 3, stop. A serve that can't emit clean tool calls or can't stay up isn't a production option, whatever its benchmark says.

What this doesn't prove

Be honest about the edges. This isn't a global provider ranking, and one serve result on one model doesn't automatically generalize to every model, prompt, or workload. It doesn't mean quantization never costs you anything, only that it wasn't the culprit here. And it's one setup: we held the model weights, the tool schema, and the number of repeats constant, and changed the serving route (a provider and its precision move together, so read this as "the serve varied," not "precision never matters"). What it does establish is narrow and worth acting on: a benchmark score on a reference serve doesn't transfer cleanly to the endpoint you'll ship on. So test that endpoint, on your model and your prompt, and trust what you measure over what anyone published.

Test the endpoint, not the leaderboard

The thread running through all five checks is the same. A leaderboard scores one model, once, on one fixed reference serve that someone else controls. Your provider runs a different serve, with a different stack, different quantization, different uptime, and different load. The score doesn't transfer. Everything that actually determines whether your workload survives production lives in the serve, and the only way to see it is to run these checks on the endpoint you're about to commit to.

This is the provider-level companion to the model-level checklist. If you're still choosing the model, start with the five checks for picking an open-source model for function calling. That piece applies the same discipline to the model. Once the model passes, run these five checks on the provider's serve. Model first, then serve. Both have to pass.

What we run internally before we ship a serve

We didn't write this from theory. We run the same discipline on our own serves before we put them in front of anyone: a tool-calling probe, a liveness and availability check, a served-identity read, and a latency read at the tail. Every one of the failure modes above is one we hit ourselves and now catch before it ships. The checklist is the thing we wish every provider published and almost none do.

You can run the same loop against any provider you're evaluating, including Token Factory. Point the tool-call check at our endpoint, read the served identifier, hammer the availability, measure your prompt, and try the base-URL swap to see how cheap it'd be to leave. That's the whole pitch: don't trust our number either. Test the serve.

Want the five checks as a runnable eval template you can point at any endpoint? Request the template → and we'll send it over.

Get started Talk to us

FAQ

Where do you find the best AI inference platform?

By testing candidates on your own endpoint. Shortlist a few providers and run five checks on each: does the serve emit valid tool calls on your schema, what model and precision is actually served, is it reliably up, does it perform on your prompt at your real concurrency, and how hard is it to leave. The best platform is the one that passes on your workload, not the one topping a list.

How do I test an inference provider before going to production?

Run five checks on the provider's actual endpoint, not on its benchmark: measure the fraction of valid tool calls on your real schema over N repeats, log the served model identifier and precision, fire enough calls to read the true error rate, measure TTFT and p95 on your own prompt, and check how hard it is to leave (does a one-line base-URL swap work) and whether the provider can grow with you. The benchmark measured the model on one reference serve; your pipeline hits a different one.

Does the same model behave differently on different inference providers?

Yes. We watched the same Llama-3.3-70B return valid tool calls only ~37% and ~41% of the time on two providers' full-precision serves and 100% on a quantized serve. The serving route, not the model or the precision, was the variable. This is why you test the endpoint you'll actually use.

Is quantization the reason tool calling breaks?

No, and assuming so points you at the wrong problem. Across 2,100 correctness-graded tool calls, quantization down to fp4 showed zero degradation. In our tests the clean serve was the quantized one and the broken serves were full-precision. Read the served identifier so you know what ran, but don't blame the precision.

How do I avoid getting locked into an inference provider?

Test three things. The exit: point your OpenAI-compatible client at the provider with just a base URL and key change; if your code runs unchanged, leaving is a one-line change, and if you had to rewrite around a proprietary API, you're already half-trapped. The switching costs that build up: data-gravity and egress fees that make it costlier to pull your data and fine-tuned weights out the longer you stay, plus integration coupling to one model family or a proprietary SDK. The ceiling: whether the provider grows with you, from serverless to dedicated capacity to fine-tuning to self-hosting, or whether the next step is a full migration to another platform. A serverless endpoint that's cheap today can be the one you tear out in six months.

What does "test the endpoint, not the leaderboard" mean?

A public benchmark scores a model once, on a single fixed serve at a reference precision that the benchmark controls. Your provider runs its own serving stack, quantization, and uptime. The score doesn't transfer to their endpoint and says nothing about repeat reliability. The five checks run on the endpoint you're about to commit to.

What Should You Check Before You Commit to an Inference Provider? Run These 5 Checks

Does the provider's serve actually emit valid tool calls?

What model is the provider actually serving you?

Is the model reliably served, or just listed?

How does it perform on your prompt, not the headline?

Can you leave, and can you grow without leaving?

The whole recipe on one page

What this doesn't prove

Test the endpoint, not the leaderboard

What we run internally before we ship a serve

FAQ

Start building today

What Should You Check Before You Commit to an Inference Provider? Run These 5 Checks

Does the provider's serve actually emit valid tool calls?#

What model is the provider actually serving you?#

Is the model reliably served, or just listed?#

How does it perform on your prompt, not the headline?#

Can you leave, and can you grow without leaving?#

The whole recipe on one page#

What this doesn't prove#

Test the endpoint, not the leaderboard#

What we run internally before we ship a serve#

FAQ#

Start building today

Does the provider's serve actually emit valid tool calls?

What model is the provider actually serving you?

Is the model reliably served, or just listed?

How does it perform on your prompt, not the headline?

Can you leave, and can you grow without leaving?

The whole recipe on one page

What this doesn't prove

Test the endpoint, not the leaderboard

What we run internally before we ship a serve

FAQ