TL;DR
A CNN doesn't see a picture. It sees a 3D tensor of numbers, slides small learnable filters across it, and stacks the results into feature maps that go from edges → textures → parts → objects. Eight primitives — tensors, filters, feature maps, stride, padding, channels, pooling, receptive fields — explain every modern vision architecture. Get those right and ResNet, U-Net, YOLO, and the Stable Diffusion encoder all stop being magic.
What sparked this
A viral X post by @tetsuoai distilled the entire CNN stack into 16 boxes. The framing is simple: forget architectures for a second; learn the eight primitives, and the architectures fall out for free. This article is the long-form version with the actual numbers.
Why this mental model matters
Most people who try to learn CNNs get stuck in two places: shape mismatches in PyTorch ("why is my tensor (1, 64, 28, 28)?") and hyperparameter cargo-culting ("why does everyone use 3×3 stride 1 padding 1?"). Both collapse to the same root cause — the engineer never internalized what each primitive does to the tensor. Once you can mentally trace a 224×224×3 input through every layer and predict the output shape, you can read any vision paper in a single pass.
The eight primitives, with numbers
1. Tensors
A single RGB image is a tensor of shape (H, W, 3). A mini-batch is 4D: (N, H, W, C). CIFAR-10 is 32×32×3; ImageNet is typically 224×224×3. The network never "sees pixels" — it sees this tensor.
2. Filters (kernels)
A filter is a small learnable tensor — usually 3×3×C. The ×C is critical: filters are full-depth along channels but local along width and height. A 3×3 filter on RGB is really a 3×3×3 = 27-weight slab.
3. Feature maps
Slide the filter, dot-product at every position, write a scalar. The output is a 2D map showing where that pattern fires. Apply K filters, get K stacked feature maps — that's your next tensor's depth.
4. Stride
Stride S is how many pixels the filter jumps each step. S=1 = dense scan. S=2 = halve the output (and the compute). Used as a cheap downsample in modern nets that drop pooling.
5. Padding
Zeros added around the border so the filter can sit on edge pixels. With P = (F−1)/2 and stride 1, output size equals input size. This is exactly why "3×3, stride 1, padding 1" is the most common conv setting on Earth — it preserves dimensions, so you can stack 50 of them without size-tracking headaches.
6. Channels
Channels are feature detectors. RGB has 3 input channels. After a conv with 64 filters, you have 64 output channels — each one a different learned pattern. Don't confuse "channel depth" (third tensor axis) with "network depth" (number of layers); they share the word but mean different things.
7. Pooling
The standard is 2×2, stride 2, max. It throws away 75% of activations but keeps the strongest signal in each 2×2 window. Zero learnable parameters. Pooling is destructive — modern segmentation nets often skip it for strided conv instead.
8. Receptive fields
How much of the original image one neuron "sees". Stack three 3×3 conv layers and a single neuron in layer 3 has an effective 7×7 view of the input. Three 3×3s cost 3×9 = 27 parameters; one 7×7 costs 49. That's 45% fewer params plus two extra non-linearities — the entire reason VGG-style 3×3 stacks beat older designs.
The one formula that solves 95% of shape bugs
Output spatial size:
(W − F + 2P) / S + 1Worked examples:
| Input W | Filter F | Stride S | Padding P | Output |
|---|---|---|---|---|
| 7 | 3 | 1 | 0 | 5 |
| 7 | 3 | 2 | 0 | 3 |
| 5 | 3 | 1 | 1 | 5 (size preserved) |
| 227 | 11 | 4 | 0 | 55 (AlexNet conv1) |
CNN vs MLP vs Vision Transformer
The architectural primitives matter because they're the trade-off you're picking:
| Property | MLP | CNN | Vision Transformer |
|---|---|---|---|
| Input handling | Flatten 32×32×3 → 3072 | Keep 3D tensor | Split into 16×16 patches |
| Locality bias | None | Strong (small filters) | Weak (attention) |
| Translation equivariance | No | Yes (param sharing) | Approximate |
| Data efficiency | Low | High | Low (needs huge pretraining) |
CNNs win on small and medium data because the architecture itself encodes "nearby pixels matter" and "the same pattern repeats everywhere". ViTs only catch up past ~100M images of pretraining.
Who should care
- Anyone learning ML. These eight primitives are load-bearing for understanding ResNet, U-Net, YOLO, the Stable Diffusion VAE encoder, and every diffusion-model conv block.
- Engineers debugging shapes. Output mismatch errors in PyTorch are ~95% padding/stride bugs. The formula above solves them in 10 seconds.
- Practitioners onboarding to vision. Knowing why
3×3, stride 1, padding 1is the default saves a week of guessing. - Researchers reading papers. Receptive-field arithmetic explains why dilated convs, strided convs, and global average pooling exist.
Limitations & gotchas
- CNNs are not rotation-invariant. They handle translation but a rotated cat looks like a different image — fix with augmentation.
- Receptive field grows slowly. A 50-layer plain 3×3 CNN reaches only ~100×100 in theory, less in practice. This is why dilated convs and pooling exist.
- Pooling is destructive. Bad for segmentation; modern nets often replace it with strided conv.
- Effective ≠ theoretical receptive field. The effective field is roughly Gaussian and much smaller than the theoretical one — not all input pixels contribute equally (Distill, 2019).
What's next
The frontier in 2026 is hybrid: ConvNeXt, MaxViT, and similar architectures keep the CNN inductive biases (locality, translation equivariance, parameter sharing) but bolt on attention for global context. They dominate vision-on-a-budget benchmarks. On phones, MobileNetV4 and EfficientNet-V2 still rule because the parameter-efficiency story above is still the cheapest way to ship vision to a device.
The eight primitives don't go away. They just get composed differently.
Sources: CS231n, CNN Explainer, Distill — Receptive Fields, original 16-box post by @tetsuoai.