TL;DR: There's no single best open-source model for function calling. Tool-call reliability depends as much on how a model is served as on the model itself, and most teams only check the model. Before you wire one into an agent, run five checks on your own endpoint: does it emit a valid tool call, does it survive your real prompt, does it leave budget to answer, are you served the model you think, and what does it cost on your task. Then watch two more: speed and availability. A leaderboard rank is necessary. It is not enough.
Unfortunately, the answer isn't that simple. There's no single best open-source LLM for function calling. Reliability depends on how a model is served as much as on the model itself, so the best pick is whatever survives five checks on your actual workflow.
Here's why it matters, concretely. We ran roughly 50 served open models through a tool-calling harness. Llama 3.3 70B answered a 4-token health probe with "ok" in under two seconds, looked green, then buried our real 6,400-token prompt under about 2,500 tokens of a single ! character until it hit the token limit. The same Llama 3.3 70B, served at full precision by two different providers, botched the arguments on roughly 60% of its tool calls, while the fp8 serve of the same weights ran clean. None of that shows up on a model card or a leaderboard, because the card describes the weights and the leaderboard scores one fixed serve. Your pipeline runs a different serve.
We kept the five checks model-agnostic on purpose. They work on any OpenAI-compatible endpoint, not just ours.
Check 1: does it emit a valid tool call?
"Function calling: supported" is a flag, not a guarantee. Three things go wrong, and only the third looks like a bug.
Some models accept your tool definitions, return 200 OK, and emit no tool_call at all. They write prose describing the call instead ("I've used the get_weather tool to find..."), which your agent can't execute. Sometimes it isn't the model, it's the tool_choice knob. We watched a model return only prose in tool_choice: auto — "I've used the get_weather tool to find the current weather in Paris" with no call attached — then emit clean, valid get_weather calls every single time once we set tool_choice: required. Some models are conservative and won't volunteer a call in auto mode, and that behavior can differ from one serve to the next. So before you write a model off as "can't call tools," try required. If it still won't emit, or it hands you the call as raw text instead of a structured tool_calls field, that's the serve, usually a missing tool-call parser. Either way you only learn it by testing the endpoint you'll actually run on.
Some emit the call but with malformed arguments. GLM-5.2, the latest flagship OSS model, looked broken in our first run for the same reason: it shaped the call wrong, with fields and values that didn't match the schema, not bad reasoning. We fixed it by tightening the prompt, not by swapping the model.
And "supported" is not "tested." DeepSeek-V3.2 was catalogued with an unknown tool-calling status. We smoke-tested it and it emitted clean calls. A label tells you nothing either way.
A leaderboard doesn't save you here. The Berkeley Function Calling Leaderboard does score emission and argument correctness, but it scores each model once, on one serve Berkeley controls, at a reference precision. That tells you the model can call tools under those conditions. It says nothing about your provider's endpoint, where a different quantization, chat template, or tool-call parser changes the answer, and nothing about call-to-call reliability under load.
What to do: on your provider, with your tool schema, classify each candidate three ways. Emits a valid call, accepts but stays silent, or rejects. Build only on the first, and validate the arguments against your schema across a few repeats. Don't trust "unknown" as "incapable." Test it.
Check 2: does it survive your real prompt, not a health probe?
A 4-token probe is a liveness check a broken model will pass. That Llama 3.3 70B case from the intro is the whole point: "ok" in under two seconds on the probe, then ~2,500 tokens of ! on the real prompt, reproduced across runs. Mistral-Nemo on the exact same prompt returned a clean 483-word answer. The big model wasn't slow or rate-limited. It produced garbage that only appeared at real input length.
What to do: probe with input that looks like your real prompt, including length and tool schema. Add a cheap guard for degenerate output: long response, almost no unique characters, is not an answer.
Check 3: does it leave budget to answer?
Reasoning models can spend the whole completion budget thinking and hand back empty content. GPT-OSS-120B did exactly that on a short tool calling and synthesis task: hit the token limit, returned nothing visible, billed for the call. Other models cleared the same task in a few dozen tokens.
There's a fix, and it has a cost you have to measure. Give the model more room (raise max_tokens), and demand the must-have structure first so the answer lands before the thinking runs long. Across seven reasoning models, four flipped from partial to 100% format conformance with that change: MiniMax-M3 (20% to 100%), MiniMax-M2.7 (60 to 100), Qwen3.6-35B (80 to 100), Kimi-K2.6 (0 to 100). But conformance isn't the finish line. Kimi-K2.6 only got there by running about 8× slower, roughly 204 seconds per run, which is not a production agent. GLM-5.2 got faster and cheaper with the same change. GPT-OSS-20B couldn't be fixed at all and stayed dead.
One caveat: forcing structure before reasoning can hurt on genuinely hard problems, where the model benefits from thinking before it commits. So conform first, then re-measure latency and cost, and check the answer didn't get worse. "Conforms" and "conforms within budget, at quality" are two different gates.
Check 4: are you served the model you think? (the serve is the variable, not the precision)
The instinct when a model misbehaves is to blame quantization. We tested that head-on and the instinct is wrong.
Across 2,100 correctness-graded tool calls, full precision versus quantized, down to fp4, the quantized arms matched the full-precision arms. Qwen3-235B, Mistral-Small-24B, and DeepSeek-V4-Pro all scored ~100% on correct tool and correct values, fp4 included. So in our tested arms, the lower-precision serves were not the weak link.
What broke it was the serve. The same Llama 3.3 70B, served at full precision by two different providers, returned malformed tool-call arguments on roughly 60% of calls, reproduced on both. The fp8 serve of the same weights was flawless. Same model, higher precision, worse output. The variable was the serving stack, not the number of bits. (Honest caveat: each arm was a different provider, so precision and provider are tangled. The narrow claim is the safe one: the lower-precision arms were not worse, and the serve is what varied.)
What to do: read the served-model identifier in the response so you know what you actually ran, then test tool-call validity on that exact endpoint. The precision label doesn't predict reliability. The serve does.
Check 5: what does it cost on your task, not the headline?
Per-token rates mislead, because the bill is per task and models spend wildly different token counts to reach the same answer. Here's a real one. We ran a write-capable CRM hygiene skill (search, dedupe, create contact and lead, summarize last interactions, classify ICP and persona - requiring judgement) across served models and graded safety, cost, and speed on the same task:
| Model | Served | Safety | $/task | Speed | Read |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 | frontier | 100% | $0.015 | 8.4s | the bar |
| DeepSeek-V3.2 | open | 100% | $0.0005 | 4.6s | frontier-grade, ~30× cheaper |
| gemma-4-31b | open | 98% | $0.0003 | 4.3s | cheapest that held up, ~50× cheaper |
| Llama-3.3-70B | open | 95% | $0.0002 | 3.3s | cheap, needs review |
| GLM-5.2 | open | 88% | $0.0073 | 12.5s | not worth it here |
| Kimi-K2.6 | open | 26% | $0.011 | 46s | 80% errored, unusable |
Read the bottom two rows. Kimi-K2.6 was the most expensive open model on the board and the worst, because it errored on most calls and burned tokens doing it. The headline rate told you none of that. DeepSeek-V3.2 hit frontier safety at a thirtieth of the cost. Same task, same grading.
What to do: run your representative prompt, read the actual completion_tokens, multiply by the rate, and compare cost-per-successful-task, not price-per-token. A cheap rate card on a model that fails half its calls is the expensive option.
Two more to watch before you commit: speed and availability
The five checks are about correctness. Two more decide whether a correct model is usable, and like Check 4, both live in the serve.
Performance: measure latency, and watch the tail. Record time-to-first-token and tokens per second on your endpoint with your real prompt, then look at the variance, not the mean. A model that's fast on average but spikes on tail requests still breaks a user-facing agent, because users feel the p95. There's no headline number to trust here, only the one you measure on the serve you'll run on.
Availability: "listed" is not "served." A model can be in the catalog and still be down, disabled, or flaky. Run a batch of calls and read the error rate. In our testing, Qwen3-235B was disabled and 503'd every one of 30 calls while still appearing in a stale model list, and Kimi-K2.6 errored on roughly 80% of requests. Neither is visible from a model card or a single smoke test. Watch for flapping too: a serve that's healthy now and degraded an hour later is more dangerous than one cleanly down, because your monitoring keeps trusting it.
The prompt is a bigger lever than the model
One rule runs through all of this: a bare probe wrongly rejects capable models. In the mid-tier especially, the prompt moves more than the model. gemma-4-31b went from 74% to 98% safety from one change, making the decision fields required in the schema. DeepSeek-V3.2 hit 100% the moment the prompt carried an explicit output template. If you evaluate with a thin prompt, you'll disqualify models that would have worked and mis-rank the ones that pass. Probe with the output contract you'd actually ship.
What we run internally
This isn't theory for us. We test the models we actually use with a tool-calling harness, on a schedule, not once. We grade them on real skills and agents, not benchmarks, and we still haven't found an open model that's best at everything. The leaderboard isn't our production stack. And we tune the prompt per model, because each one behaves differently and asks for different things: one won't call a tool until you set tool_choice: required, another needs the structure demanded first, another just needs a worked example.
It shows up in production, not just on a scorecard. On our own Token Factory, an open model ran a write-capable CRM skill end to end against live HubSpot: it deduped against an existing company instead of creating a duplicate, filed real records, zero corruption, at roughly a thirtieth of the frontier cost.
That's the point of the five checks. The model that wins is the one that survives them on your task, your prompt, and your serve. Not the one at the top of a leaderboard.
FAQ
What's the best open-source LLM for function calling?
There's no single answer that survives production, because tool-call reliability depends on the serve as much as the model. The right pick is the one that, on your provider and your prompt, emits valid tool calls, survives your real prompt length, leaves budget to answer, is served at the precision you expect, and costs the least per successful task. Run the five checks above against your shortlist instead of trusting a rank.
An open model accepts my tools but never emits a tool call. What do I do?
You can't fix the serving layer if you don't run it, so this is a routing decision. If you get a 400 asking you to enable an auto tool-call parser, the provider hasn't turned it on for that model. If you get a 200 with prose instead of a structured call, the model accepts tools but won't emit a structured call. First flip tool_choice to required — some models won't volunteer a call in auto mode and only call when forced. If it still returns prose, or returns the call as raw text instead of a tool_calls field, that's usually a serving-layer gap (a missing tool-call parser). Either way: switch model or provider, or report it. Classify candidates as emits / accepts-only / rejects before you build on them.
How do I know if my provider quantized my model, and does it matter?
Read the served-model identifier in the response. Providers commonly serve an FP8 build of the model you named. But our testing (2,100 graded calls, down to fp4) showed precision wasn't the thing that mattered: quantized serves matched full precision, while two full-precision serves of the same model failed ~60% of their calls. Test tool-call validity on your actual endpoint rather than trusting, or distrusting, a precision label.
Why does a reasoning model return empty content but still bill me?
It spent the whole completion budget reasoning and hit the token limit before emitting a visible answer. Give it more room and demand the structured output first, then re-measure, because some models only conform by getting much slower and pricier (Kimi-K2.6 hit ~204 seconds a run), which is its own failure.