TL;DR

Avishek Biswas — known on X as @neural_avb and on YouTube as Neural Breakdown with AVB — released a ~50-minute tutorial that compresses everything an ML engineer needs to actually train a reasoning model with GRPO on RLVR environments. The unique angle: he runs the algorithm on sub-1B-parameter models (SmolLM-135M / 360M, Qwen3-0.6B), uses text-based gym envs from reasoning-gym, and includes an animated PPO walkthrough where you watch logits update with each policy step. Code is open source.

What's new

Most public GRPO content lives in one of two places: the math papers (DeepSeekMath, the verl docs) or 7B+ scale demos that need a multi-GPU node. AVB's drop sits in the missing middle — and adds three things you don't usually get together:

  • A visual, animated tour of GRPO: instead of just printing the loss formula, the video shows the group-of-responses sampling, the within-group reward normalization, and the resulting policy gradient as a moving picture.
  • Text-based gym envs: deductive Syllogism and Propositional Logic tasks pulled from the reasoning-gym library, with verifiable, rule-based rewards instead of a learned reward model.
  • PPO math at the logit level: a deep dive into the surrogate objective where the camera literally hovers over the model's output logits and you see them shift after each policy update — the kind of thing every PPO blog post hand-waves past.

Why it matters

GRPO went mainstream after DeepSeek-R1, but the on-ramp for everyone who isn't at a frontier lab is still bad. TRL gives you a one-liner trainer; the papers give you the full Bellman-style derivation; almost nothing in between explains what the algorithm is doing on a small model you can actually fit on one GPU. AVB's contribution is the bridge — and at 50 minutes, it's long enough to teach the math without devolving into a code-along.

Technical facts

Drawing from AVB's Towards Data Science companion piece, here is the recipe the video walks through:

ComponentChoice
ModelsSmolLM-135M-Instruct, SmolLM-360M-Instruct, Qwen3-0.6B
AlgorithmGRPO (no critic network)
RewardRLVR — rule-based verifier, not a learned RM
TasksSyllogism, Propositional Logic (from reasoning-gym)
FormatChain-of-thought wrapped in <think> / <answer> tags
Learning rate1e-6
Sampling temp0.7
Group size G6 responses per prompt
Replay buffer500
Grad accumulation12 steps
Max new tokens300

Reported gains: SmolLM-135M starts around 60% accuracy on Syllogism after SFT and adds roughly +20% absolute from RL across all three model sizes. SmolLM-360M lands at 81% on Propositional Logic — the harder of the two envs.

Comparison: PPO vs GRPO at the small-model scale

DimensionPPOGRPO
Critic networkRequired, ~policy sizeNone — group baseline replaces it
VRAM cost~2× model in memory~Half — fits sub-1B training on a single GPU
Advantage signalGAE from learned V(s)Z-score within a group of G responses
KL termReward shaping (per token)Subtracted from surrogate as a separate penalty
Best fitRLHF with preference RMsRLVR with verifiable checkers

The GRPO advantage in one line: A_i = (R(r_i) − mean(group)) / std(group). The full objective is the PPO clipped surrogate plus a KL penalty against the reference policy: L_GRPO = L_clip − w1 · D_KL(π_θ ‖ π_orig). AVB walks through both terms on screen.

Use cases

  • Indie reasoning RL: the SmolLM-135M + GRPO recipe is the cheapest end-to-end reasoning RL you can run today — single consumer GPU, hours not days.
  • Domain-specific verifiable tasks: anything where you can write a checker — regex matchers, math equality, unit tests, JSON schema validators — slots into the same RLVR loop with no reward-model training.
  • Teaching PPO: the animated logit-update segment is the rare resource that closes the gap between reading the PPO paper and actually understanding what the optimizer is doing.
  • Research baselines: a from-scratch GRPO impl (not a TRL wrapper) is a clean starting point for ablations on group size, KL weight, or curriculum design.

Limitations & pricing

  • Free — YouTube video + the TDS article + open code on GitHub.
  • Tasks are constrained logic puzzles, not long-form chat, coding, or open-domain reasoning.
  • Tiny still means a GPU. SmolLM-135M trains on a single consumer card; CPU-only is not in scope.
  • The from-scratch impl is for understanding, not throughput. For real runs, verl, TRL, and Unsloth are faster.
  • Gains are reported against the SFT baseline of the same small model, not against frontier reasoners.

What's next

The obvious follow-ups: multi-task RLVR curricula, scaling the same recipe to 1B–3B models, and porting to production-grade trainers like verl or Unsloth's GRPO path. If you have been waiting for a GRPO tutorial that goes deep on the math and ships working code at a size you can run, this is it.

Sources: @neural_avb on X, Towards Data Science, Yuge Shi's PPO & GRPO guide, DeepSeekMath.