- Avishek Biswas (@neural_avb) shipped a 50-minute long-form tutorial that walks through GRPO low-level mechanics, trains sub-1B SmolLM and Qwen3 models on text-based RLVR gym envs, and animates PPO updates so you literally see the policy logits shift.
TL;DR
Avishek Biswas — known on X as @neural_avb and on YouTube as Neural Breakdown with AVB — released a ~50-minute tutorial that compresses everything an ML engineer needs to actually train a reasoning model with GRPO on RLVR environments. The unique angle: he runs the algorithm on sub-1B-parameter models (SmolLM-135M / 360M, Qwen3-0.6B), uses text-based gym envs from reasoning-gym, and includes an animated PPO walkthrough where you watch logits update with each policy step. Code is open source.
What's new
Most public GRPO content lives in one of two places: the math papers (DeepSeekMath, the verl docs) or 7B+ scale demos that need a multi-GPU node. AVB's drop sits in the missing middle — and adds three things you don't usually get together:
- A visual, animated tour of GRPO: instead of just printing the loss formula, the video shows the group-of-responses sampling, the within-group reward normalization, and the resulting policy gradient as a moving picture.
- Text-based gym envs: deductive Syllogism and Propositional Logic tasks pulled from the
reasoning-gymlibrary, with verifiable, rule-based rewards instead of a learned reward model. - PPO math at the logit level: a deep dive into the surrogate objective where the camera literally hovers over the model's output logits and you see them shift after each policy update — the kind of thing every PPO blog post hand-waves past.
Why it matters
GRPO went mainstream after DeepSeek-R1, but the on-ramp for everyone who isn't at a frontier lab is still bad. TRL gives you a one-liner trainer; the papers give you the full Bellman-style derivation; almost nothing in between explains what the algorithm is doing on a small model you can actually fit on one GPU. AVB's contribution is the bridge — and at 50 minutes, it's long enough to teach the math without devolving into a code-along.
Technical facts
Drawing from AVB's Towards Data Science companion piece, here is the recipe the video walks through:
| Component | Choice |
|---|---|
| Models | SmolLM-135M-Instruct, SmolLM-360M-Instruct, Qwen3-0.6B |
| Algorithm | GRPO (no critic network) |
| Reward | RLVR — rule-based verifier, not a learned RM |
| Tasks | Syllogism, Propositional Logic (from reasoning-gym) |
| Format | Chain-of-thought wrapped in <think> / <answer> tags |
| Learning rate | 1e-6 |
| Sampling temp | 0.7 |
| Group size G | 6 responses per prompt |
| Replay buffer | 500 |
| Grad accumulation | 12 steps |
| Max new tokens | 300 |
Reported gains: SmolLM-135M starts around 60% accuracy on Syllogism after SFT and adds roughly +20% absolute from RL across all three model sizes. SmolLM-360M lands at 81% on Propositional Logic — the harder of the two envs.
Comparison: PPO vs GRPO at the small-model scale
| Dimension | PPO | GRPO |
|---|---|---|
| Critic network | Required, ~policy size | None — group baseline replaces it |
| VRAM cost | ~2× model in memory | ~Half — fits sub-1B training on a single GPU |
| Advantage signal | GAE from learned V(s) | Z-score within a group of G responses |
| KL term | Reward shaping (per token) | Subtracted from surrogate as a separate penalty |
| Best fit | RLHF with preference RMs | RLVR with verifiable checkers |
The GRPO advantage in one line: A_i = (R(r_i) − mean(group)) / std(group). The full objective is the PPO clipped surrogate plus a KL penalty against the reference policy: L_GRPO = L_clip − w1 · D_KL(π_θ ‖ π_orig). AVB walks through both terms on screen.
Use cases
- Indie reasoning RL: the SmolLM-135M + GRPO recipe is the cheapest end-to-end reasoning RL you can run today — single consumer GPU, hours not days.
- Domain-specific verifiable tasks: anything where you can write a checker — regex matchers, math equality, unit tests, JSON schema validators — slots into the same RLVR loop with no reward-model training.
- Teaching PPO: the animated logit-update segment is the rare resource that closes the gap between reading the PPO paper and actually understanding what the optimizer is doing.
- Research baselines: a from-scratch GRPO impl (not a TRL wrapper) is a clean starting point for ablations on group size, KL weight, or curriculum design.
Limitations & pricing
- Free — YouTube video + the TDS article + open code on GitHub.
- Tasks are constrained logic puzzles, not long-form chat, coding, or open-domain reasoning.
- Tiny still means a GPU. SmolLM-135M trains on a single consumer card; CPU-only is not in scope.
- The from-scratch impl is for understanding, not throughput. For real runs,
verl,TRL, andUnslothare faster. - Gains are reported against the SFT baseline of the same small model, not against frontier reasoners.
What's next
The obvious follow-ups: multi-task RLVR curricula, scaling the same recipe to 1B–3B models, and porting to production-grade trainers like verl or Unsloth's GRPO path. If you have been waiting for a GRPO tutorial that goes deep on the math and ships working code at a size you can run, this is it.
Sources: @neural_avb on X, Towards Data Science, Yuge Shi's PPO & GRPO guide, DeepSeekMath.



