Skip to content
    Fine-Tuning ModelsReinforcement LearningLLMGPUH100

    How to Use EasyR1 for Reinforcement Learning on FlexAI

    May 11, 202613 min read

    EasyR1 is a reinforcement learning fine-tuning framework that supports GRPO, DAPO, and REINFORCE for reasoning-focused post-training. Use it when SFT starts plateauing on tasks like math, code, or logic and you can define a reliable reward signal. On FlexAI, you get a published reference workflow, managed checkpoints, and a direct path from training to serving.

    Your supervised fine-tuned model can follow instructions and mimic patterns, but on tasks that require multi-step reasoning — math problems, code debugging, logical deduction — it often hits a ceiling. SFT learns to imitate correct outputs; it doesn't learn the underlying problem-solving process. Reinforcement learning can push past that ceiling by training the model to maximize a reward signal rather than match a label. For instruction following, style adaptation, or domain transfer, SFT is usually the right call. RL makes sense when SFT accuracy has flatlined and the task has a clear, evaluable correct answer. This post explains when and how to use EasyR1, which algorithm to pick, what the FlexAI reference setup looks like, what it costs, and how to know whether it actually worked.

    What Is EasyR1?

    DeepSeek released R1 in early 2025 as a proof point for large-scale reasoning-oriented RL: a model that drew attention for stronger benchmark performance and longer, more structured reasoning traces during inference. The important nuance is that GRPO did not start with R1. DeepSeek introduced Group Relative Policy Optimization (GRPO) earlier in the DeepSeekMath work, then scaled RL much more aggressively in the R1 recipe.

    Open-source teams then made the workflow easier to reproduce. hiyouga — the developer who also built LlamaFactory — released EasyR1 with a strong focus on scalable rollout training and vision-language model support. Built on the veRL training stack, EasyR1 is geared toward distributed rollouts across multiple GPUs, memory-efficient execution, and support for models too large to fit on a single GPU. EasyR1 also implements DAPO, a newer RL recipe aimed at tasks with hard correctness signals such as math or coding.

    The key insight: RL fine-tuning is not just a research curiosity anymore. In FlexAI's published EasyR1 blueprint example, a 7B run uses a single node with 8×H100 GPUs and is framed as a practical reference setup rather than a custom research stack. But most teams still have not run this workflow themselves, because managing checkpoints, distributed training, and job orchestration has historically required significant DevOps investment.

    Why reinforcement learning instead of supervised fine-tuning?

    Supervised fine-tuning (SFT) works by showing the model pairs of instructions and desired outputs, then optimizing it to match those outputs. It's fast, less compute-intensive than RL, and effective for instruction following, knowledge distillation, and style transfer. If your goal is to make a model follow your custom format or adopt a specific tone, SFT is the right tool.

    But SFT has a hard boundary: it can only teach the model to imitate the data it sees. If the task requires reasoning that isn't explicit in the training data—multi-step math, code debugging, logical deduction—SFT stalls. The model learns correlations but not the underlying problem-solving process.

    Reinforcement learning inverts the problem. Instead of showing the model correct answers, you show it a scoring function (a reward model) that evaluates whether an answer is correct, then train the model to maximize that score. The model learns to generate its own reasoning to reach higher rewards.

    In practice, this means:

    • For instruction following, domain adaptation, or style transfer: SFT is faster and cheaper. Use QLoRA, full fine-tuning, or continued pre-training.

    • For reasoning, multi-step math, or code generation: SFT will plateau. RL reaches higher accuracy by learning the problem-solving structure.

    • For tasks where "correct" is objective (math, multiple-choice, exact match): RL converges reliably. For subjective tasks (creative writing, open-ended advice), you need a well-tuned reward model.

    If your baseline SFT model has clearly plateaued and you still need materially better reasoning accuracy, RL may be the next technique to test. If your model is already performing well and you only need marginal gains, the effort-to-return ratio often doesn't justify it.

    EasyR1 GRPO vs DAPO: Which Should You Use?

    EasyR1 implements two core algorithms: GRPO and DAPO. They solve the same problem (learning from rewards instead of labeled outputs) but with different tradeoffs.

    GRPO (Group Relative Policy Optimization) is the RL algorithm DeepSeek introduced in DeepSeekMath and later used at larger scale in the R1 line. For each prompt, you generate multiple candidate responses and score them. GRPO then compares answers within the group, computing an advantage for each based on relative quality. It's well-established, doesn't require a separate critic or value function, and works across reasoning domains.

    GRPO shines when:

    • Your reward signal is noisy or subjective (e.g., a learned model that ranks code quality)

    • You're fine-tuning a general-purpose model across many task types

    • You want stable training without collapsing to bad local optima

    • You have compute for generating and scoring multiple candidates per prompt (this is the main cost)

    DAPO is a newer RL recipe designed for large-scale reasoning training. In practice, its appeal is that it adds mechanisms for long outputs, response filtering, and sampling behavior that are meant to make RL on verifiable tasks more reliable. That makes it especially relevant when you care about long chain-of-thought-style completions or tightly scored tasks such as math and code.

    DAPO is often a stronger candidate when:

    • Your task has objective, verifiable correctness (multiple-choice, math, code with unit tests)

    • Long, step-by-step answers are valuable (the model should think through hard problems)

    • You're tuning for a narrow use case, not broad generalization

    • You can define a crisp reward model (binary pass/fail, exact match, test suite)

    Quick decision table: These are starting-point heuristics, not benchmark-proven absolutes.

    If your task looks like... Start with Why
    Mixed reasoning + fluency tasks GRPO More established default for general-purpose RL fine-tuning
    Math or logic problems with exact answers DAPO Better fit when correctness is objective and long reasoning traces matter
    Code generation with unit tests DAPO Verifiable reward signal makes the setup cleaner
    You are unsure and want the lowest-risk starting point GRPO Easier baseline before testing narrower optimizations

    Decision framework:

    • Tuning a reasoning model for math or logic puzzles? Start with DAPO.

    • Tuning for code generation with automated tests? Start with DAPO.

    • Tuning across mixed reasoning + fluency tasks? Start with GRPO.

    • Unsure? Start with GRPO as the more established default, then test DAPO if long reasoning traces and verifiable reward signals are central to your use case.

    Both algorithms are available in EasyR1. FlexAI's blueprint covers configuration for both. The training loop is identical; you only change the algorithm field in your YAML config.

    EasyR1 also implements REINFORCE, the simplest policy gradient baseline. Unlike GRPO and DAPO, REINFORCE doesn't compare responses within a group—it updates directly on individual response outcomes. This makes it higher-variance and less stable in practice. Use it only as a debugging baseline to verify that your reward function is working, not as a production training method.

    What you need to run EasyR1 on FlexAI

    To run the FlexAI EasyR1 fine-tuning blueprint, you'll need:

    Hardware: Our blueprint recommends a single node of 8×H100 GPUs, sufficient for fine-tuning a 7B model (Qwen2.5-7B is used in the blueprint) with group size 5 (5 candidate responses per prompt). If you're fine-tuning a larger model (13B or 70B), you'll need additional nodes or will need to reduce group size to fit in memory.

    FlexAI's infrastructure is optimized for this: sub-60-second job launch, and managed distributed training with vLLM and FSDP (fully sharded data parallel). You do not manage network configurations — FlexAI handles infrastructure setup. Checkpoints are saved automatically so you don't lose progress if a run fails, but restarting a failed job is a manual step.

    Input data format: EasyR1 expects a dataset of prompts with a scoring function. You provide:

    1. A training dataset: list of prompts (e.g., math problems, coding tasks)

    2. A reward model or objective function that scores responses (e.g., "is the answer correct?")

    The framework generates multiple responses per prompt, scores them, and updates the model. You do not need labeled outputs—only prompts and a way to evaluate responses. For math, this is a script that checks correctness. For code, it's your test suite.

    Software setup: Clone docs.flex.ai/blueprints/easyR1, which contains the YAML configuration template and step-by-step instructions. The blueprint gives you a working runtime, requirements path, and launch command, so you do not have to assemble the training environment from scratch.

    Reward model: This is the critical part. Your reward model doesn't need to be perfect, but it does need to distinguish between good and bad responses reliably. Options:

    • Objective scoring: For math, check if the final answer matches. For code, run unit tests. For classification, check if the predicted label is correct.

    • Learned model: Train a separate classifier that predicts whether a response is good. This is slower but works for subjective tasks.

    • LLM as judge: Use a stronger frontier model to score responses. This is expensive per iteration but very reliable.

    For most teams, start with objective scoring. It's cheap, fast, and often sufficient.

    Reward function design patterns worth knowing:

    • Partial credit: Binary pass/fail rewards can slow convergence on hard tasks. Consider awarding partial credit for correct reasoning steps even when the final answer is wrong (e.g., 0.5 for showing valid work, 1.0 for correct answer). This stabilizes early training and gives the model more signal to learn from.

    • Reward hacking safeguards: Models will exploit weaknesses in your reward function. Common failure modes: generating very short outputs that pass a length check, copying the prompt verbatim to satisfy a format regex, or producing answers that match a pattern but are semantically wrong. Add negative rewards for empty reasoning chains, repeated tokens, or format violations. Manually sample scored outputs every 100 steps to catch drift early.

    • Format rewards: If you want the model to reason in a specific structure (e.g., <think>...</think><answer>...</answer>), add a small format compliance reward alongside your correctness signal. Without it, the model often drifts to unstructured output even when answers are correct.

    How long does it take and what does it cost?

    Blueprint example: The FlexAI EasyR1 blueprint fine-tunes Qwen2.5-7B-Instruct on math12k — 12,000 math problems — using a reward function that scores both format compliance and answer correctness. After GRPO training, the model more visibly starts showing step-by-step reasoning instead of bare answers. Full YAML config, reward function code, and a before/after comparison are in the blueprint.

    You don't need to build the stack — you need a reward function and a clear experiment plan. In the FlexAI blueprint example, using EasyR1 to fine-tune a 7B model on a single 8×H100 node is presented as a manageable reference setup rather than a large infrastructure project. At the current public rate of $2.10/hr per H100 on FlexAI Cloud Services, an 8×H100 node costs $16.80/hour. In an illustrative 10-hour scenario, that run would cost $168. This includes:

    • Managed infrastructure with built-in observability

    • Distributed training orchestration (vLLM, FSDP, checkpoint management)

    • Automatic checkpoint saves (no lost runs from transient failures)

    • Sub-60-second job launch (start training minutes after you submit)

    Cost comparison: If you run 3 training experiments (tweaking hyperparameters, trying different datasets, or testing GRPO vs DAPO), that's an illustrative ~$504 at the example hourly rate above. FlexAI charges for the compute time you reserve, not per generated token, so the hourly rate itself stays predictable. But more rollouts, longer completions, or heavier evaluation still extend wall-clock runtime, which raises the final bill. When you're ready to run, it's one command: flexai training run.

    Throughput: Each training job produces checkpoints every N steps. Once training completes, you can move quickly from checkpoint selection to inference deployment using FlexAI's checkpoint and managed inference workflow.

    How to evaluate whether it worked

    After training, you need to answer: Did this actually improve the model? Supervised fine-tuning is easy to benchmark (just compare accuracy on a holdout test set). RL is trickier because you're optimizing for a process (step-by-step thinking), not just final accuracy.

    Evaluate on three dimensions:

    1. Accuracy on the target task. Generate responses from the trained model on your test set and score them using your reward function. For standardized benchmark comparisons, tools like lm-evaluation-harness work well. For your own task-specific holdout set, use the same reward script you trained with so pre-RL and post-RL comparisons stay consistent. Compare to the baseline (pre-training or SFT). If accuracy went up, RL worked. If it stayed flat or dropped, check your reward model—it might be misaligned.

    2. Reasoning quality and length. RL models tend to generate longer, more structured outputs. Use a framework like Prometheus or a simple LLM-as-judge script to score reasoning quality, or manually sample 20–50 outputs and read them.

    3. Generalization to held-out domains. Train on one type of math problem, test on a different type. Does the model transfer learning? Do not assume it will. If performance drops sharply on new domains, you likely overfit the reward model or tuned too tightly to your training set.

    Practical baseline: Start with your best SFT model (or a public model like Qwen2.5-7B). Run RL for 1–2 epochs on your dataset. Measure accuracy before and after. If you see a clear improvement on your holdout set, continue iterating. If flat or worse, revisit your reward model or expand your training data.

    Before you ship or even share the checkpoint widely, compare pre- and post-RL outputs on 20–50 representative inputs in a lightweight eval workflow. Does the reasoning look more structured? Are answers more complete? That's often a faster and more honest signal than any single benchmark score.

    Debugging common training failures

    EasyR1's distributed setup (vLLM + FSDP) has more failure modes than a standard SFT job. Most issues fall into three buckets.

    Infrastructure failures:

    • Container or runtime version mismatches: If EasyR1 fails to initialize vLLM rollout workers or the training actors, it's usually a version conflict. The FlexAI blueprint runtime pins the correct versions automatically — if you're customizing the setup, match your dependencies to the versions in EasyR1's requirements file.

    • NCCL timeouts during multi-node training: Typically caused by network misconfiguration. Verify all nodes can communicate on the NCCL port (default 29400). FlexAI handles this automatically on managed clusters.

    • GPU assignment mismatches: EasyR1 assigns GPUs to actor (training) and rollout (response generation) roles. If your node's GPU count doesn't divide evenly across roles, training fails at startup. Always match your node config to the blueprint spec.

    Training instability:

    • Reward stuck at zero: The reward function is never satisfied. Manually run it against 10–20 sample outputs before launching a full training job. If it returns zero for everything, the problem is the reward function, not the model.

    • KL divergence exploding (the model is drifting too far from its starting weights): The model is diverging from the original base model weights too fast. Reduce learning rate by 2x or increase the KL penalty coefficient. DAPO is less susceptible to this than GRPO due to its token-level loss weighting.

    • "Training finished but errored": A checkpoint was written before the crash. Resume from it—EasyR1 supports this natively with a config flag. Your run is not lost.

    Evaluation confusion:

    • Compare pre-RL and post-RL models using the same reward function. Switching evaluation scripts between runs makes comparisons meaningless.

    • Use temperature=0 for deterministic eval. RL-trained models are sensitive to temperature; high-temperature inference will look worse than the trained distribution suggests.

    When not to use EasyR1

    RL is not always the right choice.

    Domain adaptation on new corpora: If your task is to adapt a general model to your domain (legal docs, medical records, internal code), supervised fine-tuning is faster and cheaper. You have labeled examples; use them. RL makes the most sense when you have a reliable scoring function — ideally an objective one, like exact-match grading or unit tests — that can evaluate whether a response is correct.

    Few-shot or zero-shot scenarios: If you have <100 training examples, SFT or prompt engineering will get you farther. RL needs enough data to find meaningful signal. For small datasets, reward models are unstable.

    Fast iteration on exact tasks: If you need results in days, not weeks, SFT is faster to experiment with. RL requires defining a reward model, which takes time to get right.

    Subjective or creative tasks: If the task is open-ended (creative writing, open-ended advice), a reward model is hard to align. RL can work here, but you need a very strong judge (often a human) scoring samples, which is expensive.

    Tasks where SFT already works well: If your SFT model is already at 90%+ accuracy, the effort to rig up RL training probably isn't worth the marginal gain.

    Use EasyR1 when: you have objective correctness, modest compute, and a clear accuracy plateau that SFT cannot pass. Everything else, optimize for speed.


    Ready to push your model's reasoning further?

    Get started → with the EasyR1 blueprint, or talk to us if you want guidance on data prep, reward model design, or scaling to multi-node training.

    Get Started Today

    Start building with €100 in free credits for first-time users.