How to Benchmark Open-Source Models Before You Commit | FlexAI Blog
    Skip to content
    ModelAIBenchmarkOpen SourceModel Evaluation

    How to Benchmark Open-Source Models Before You Commit

    April 27, 2026

    You're choosing between Llama 4 Scout 17B, GPT-OSS 120B, and DeepSeek V3.2. The paper numbers look fine across all three. You pick the one that feels right and ship it.

    Three weeks later it fails on the exact task category your users care about — and you find out from support tickets, not your eval pipeline.

    This is the standard failure mode. It's not that teams skip evaluation. It's that they rely on benchmark numbers someone else ran, on tasks that don't reflect their workload. A model that leads on MMLU doesn't necessarily perform on your summarization pipeline. A reasoning model like Qwen 3 30B A3B Thinking may be overkill for structured extraction.

    lm-evaluation-harness fixes that. Here's how to use it before you commit.

    Picking the right tasks

    Most teams make the same mistake: they run MMLU because everyone runs MMLU, then wonder why the scores don't predict production behavior.

    Task selection should follow use case, not convention.

    Reasoning and knowledge

    For factual recall or multi-step reasoning — support automation, research summarization, QA over documents. General-purpose models like Llama 4 Scout 17B, GPT-OSS 120B, DeepSeek V3.2, and Gemma 4 31B are the candidates here.

    • mmlu — 57-subject knowledge coverage

    • arc_challenge — scientific reasoning, harder than arc_easy

    • hellaswag — commonsense inference

    Math and structured reasoning

    If you're evaluating dedicated reasoning models — Qwen 3 30B A3B Thinking or DeepSeek R1 Distill Llama 8B — these tasks are the ones that differentiate them from general-purpose models.

    • gsm8k — grade-school math, strong signal for multi-step numeric reasoning

    • mathqa — more complex applied math

    • math — competition math, where reasoning models pull ahead

    Instruction following

    Chat, agents, or structured output. A key differentiator between models like Llama 4 Scout and GPT-OSS 120B on instruction-heavy workloads.

    • ifeval — instruction-following eval, strict and useful

    • mt_bench — multi-turn, closer to real chat scenarios

    Code generation

    If evaluating Qwen 3 Coder 30B or Qwen 3 Coder 480B A35B against a general-purpose alternative like Llama 4 Scout, these tasks show where the specialist model earns its keep.

    • humaneval — function-level Python generation

    • mbpp — broader Python programming tasks

    Logical reasoning

    Argument evaluation, legal reasoning, or structured decisions:

    • logiqa — logic-based reading comprehension

    • anli — adversarial NLI, harder than standard NLI

    The signal you want: a model that scores well on tasks similar to your workload and acceptably on everything else. A model that leads on MMLU but drops 20 points on IFEval tells you something important.

    Running evals on FlexAI

    FlexAI's lm-evaluation-harness blueprint handles the environment — no Dockerfile, no pip install, no CUDA configuration. You bring the model, the task list, and the GPU count. It runs.

    The core command:

    flexai training run lm-eval-llama4-scout \
      --accels 4 \
      --nodes 1 \
      --repository-url https://github.com/flexaihq/blueprints \
      --secret HF_TOKEN=<your-hf-secret> \
      --requirements-path code/lm-evaluation-harness/requirements.txt \
      --runtime nvidia-25.03 \
      -- lm_eval \
          --model hf \
          --model_args pretrained=meta-llama/Llama-4-Scout-17B-16E-Instruct \
          --tasks mmlu,gsm8k,ifeval,arc_challenge \
          --device cuda \
          --batch_size 8 \
          --output_path /output-checkpoint/llama4_scout_eval.json
    

    Monitor with flexai training logs <job>, pull results with flexai checkpoint fetch <id>. Results come back as structured JSON you can compare across runs.

    Full walkthrough: docs.flex.ai/blueprints/lm-evaluation-harness

    GPU sizing: for models under 7B, 2 GPUs covers a standard task suite. For 7B–70B models, expect 4–8 GPUs and 2–12 hours depending on task count and shot configuration. You pay per GPU-hour — no reserved capacity required.

    Non-Hugging Face checkpoints: push your checkpoint to FlexAI's Checkpoint Manager first (flexai checkpoint push), then reference it with --model_args pretrained=/input-checkpoint/<your-checkpoint> in place of the HF model path.

    One practical note: run the same task suite across your candidate shortlist in parallel jobs before you pick. The relative gaps matter more than absolute scores.

    Reading the results

    Aggregate scores hide category variance. A model at 72% on MMLU might be at 85% on STEM and 58% on social sciences. If your workload is technical, the aggregate undersells it. Always look at subtask breakdowns.

    0-shot vs 5-shot delta reveals prompt sensitivity. A model that jumps 15 points from 0-shot to 5-shot on GSM8K is heavily dependent on in-context examples. That's fine if your production setup includes few-shot prompting — a liability if it doesn't.

    Score gaps under 1-2 points are noise. A 5-point gap on a well-designed task (IFEval, GSM8K) is meaningful. A 0.5-point gap on MMLU is not. Don't let marginal differences drive model selection.

    Latency and throughput aren't in the eval. A model that scores 5% higher but runs at half the tokens/second at your volume changes the cost equation entirely. Benchmark for capability first, then size for production load separately.

    Standard benchmarks don't test your specific workload. The most reliable complement is a golden eval set built from your own production data — real inputs with known correct outputs. lm-evaluation-harness supports custom tasks. Run your golden set alongside standard benchmarks for a grounded comparison.

    From eval to deployment

    Once you've picked your model, the path to production runs on the same infrastructure you used to evaluate it.

    Token Factory — launching May 9 — gives you per-token serverless inference on the same model catalog you can benchmark today: Llama 4 Scout, Deepseek v3.2, Qwen 3 (0.6B to 30B), Gemma 4, and others. No cluster management, no reserved capacity.

    Evaluate on FlexAI GPU compute. Deploy on Token Factory. Same platform, same model weights — no environment gap between what you tested and what you shipped.

    Get Started Today

    Start building with €100 in free credits for first-time users.