You're choosing between Llama 4 Scout 17B, GPT-OSS 120B, and DeepSeek V3.2. The paper numbers look fine across all three. You pick the one that feels right and ship it.
Three weeks later it fails on the exact task category your users care about — and you find out from support tickets, not your eval pipeline.
This is the standard failure mode. It's not that teams skip evaluation. It's that they rely on benchmark numbers someone else ran, on tasks that don't reflect their workload. A model that leads on MMLU doesn't necessarily perform on your summarization pipeline. A reasoning model like Qwen 3 30B A3B Thinking may be overkill for structured extraction.
lm-evaluation-harness fixes that. Here's how to use it before you commit.
Picking the right tasks
Most teams make the same mistake: they run MMLU because everyone runs MMLU, then wonder why the scores don't predict production behavior.
Task selection should follow use case, not convention.
Reasoning and knowledge
For factual recall or multi-step reasoning — support automation, research summarization, QA over documents. General-purpose models like Llama 4 Scout 17B, GPT-OSS 120B, DeepSeek V3.2, and Gemma 4 31B are the candidates here.
mmlu— 57-subject knowledge coveragearc_challenge— scientific reasoning, harder than arc_easyhellaswag— commonsense inference
Math and structured reasoning
If you're evaluating dedicated reasoning models — Qwen 3 30B A3B Thinking or DeepSeek R1 Distill Llama 8B — these tasks are the ones that differentiate them from general-purpose models.
gsm8k— grade-school math, strong signal for multi-step numeric reasoningmathqa— more complex applied mathmath— competition math, where reasoning models pull ahead
Instruction following
Chat, agents, or structured output. A key differentiator between models like Llama 4 Scout and GPT-OSS 120B on instruction-heavy workloads.
ifeval— instruction-following eval, strict and usefulmt_bench— multi-turn, closer to real chat scenarios
Code generation
If evaluating Qwen 3 Coder 30B or Qwen 3 Coder 480B A35B against a general-purpose alternative like Llama 4 Scout, these tasks show where the specialist model earns its keep.
humaneval— function-level Python generationmbpp— broader Python programming tasks
Logical reasoning
Argument evaluation, legal reasoning, or structured decisions:
logiqa— logic-based reading comprehensionanli— adversarial NLI, harder than standard NLI
The signal you want: a model that scores well on tasks similar to your workload and acceptably on everything else. A model that leads on MMLU but drops 20 points on IFEval tells you something important.
Running evals on FlexAI
FlexAI's lm-evaluation-harness blueprint handles the environment — no Dockerfile, no pip install, no CUDA configuration. You bring the model, the task list, and the GPU count. It runs.
The core command:
flexai training run lm-eval-llama4-scout \
--accels 4 \
--nodes 1 \
--repository-url https://github.com/flexaihq/blueprints \
--secret HF_TOKEN=<your-hf-secret> \
--requirements-path code/lm-evaluation-harness/requirements.txt \
--runtime nvidia-25.03 \
-- lm_eval \
--model hf \
--model_args pretrained=meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tasks mmlu,gsm8k,ifeval,arc_challenge \
--device cuda \
--batch_size 8 \
--output_path /output-checkpoint/llama4_scout_eval.json
Monitor with flexai training logs <job>, pull results with flexai checkpoint fetch <id>. Results come back as structured JSON you can compare across runs.
Full walkthrough: docs.flex.ai/blueprints/lm-evaluation-harness
GPU sizing: for models under 7B, 2 GPUs covers a standard task suite. For 7B–70B models, expect 4–8 GPUs and 2–12 hours depending on task count and shot configuration. You pay per GPU-hour — no reserved capacity required.
Non-Hugging Face checkpoints: push your checkpoint to FlexAI's Checkpoint Manager first (flexai checkpoint push), then reference it with --model_args pretrained=/input-checkpoint/<your-checkpoint> in place of the HF model path.
One practical note: run the same task suite across your candidate shortlist in parallel jobs before you pick. The relative gaps matter more than absolute scores.
Reading the results
Aggregate scores hide category variance. A model at 72% on MMLU might be at 85% on STEM and 58% on social sciences. If your workload is technical, the aggregate undersells it. Always look at subtask breakdowns.
0-shot vs 5-shot delta reveals prompt sensitivity. A model that jumps 15 points from 0-shot to 5-shot on GSM8K is heavily dependent on in-context examples. That's fine if your production setup includes few-shot prompting — a liability if it doesn't.
Score gaps under 1-2 points are noise. A 5-point gap on a well-designed task (IFEval, GSM8K) is meaningful. A 0.5-point gap on MMLU is not. Don't let marginal differences drive model selection.
Latency and throughput aren't in the eval. A model that scores 5% higher but runs at half the tokens/second at your volume changes the cost equation entirely. Benchmark for capability first, then size for production load separately.
Standard benchmarks don't test your specific workload. The most reliable complement is a golden eval set built from your own production data — real inputs with known correct outputs. lm-evaluation-harness supports custom tasks. Run your golden set alongside standard benchmarks for a grounded comparison.
From eval to deployment
Once you've picked your model, the path to production runs on the same infrastructure you used to evaluate it.
Token Factory — launching May 9 — gives you per-token serverless inference on the same model catalog you can benchmark today: Llama 4 Scout, Deepseek v3.2, Qwen 3 (0.6B to 30B), Gemma 4, and others. No cluster management, no reserved capacity.
Evaluate on FlexAI GPU compute. Deploy on Token Factory. Same platform, same model weights — no environment gap between what you tested and what you shipped.