How CNNs See Images: 16 Boxes That Cover the Entire Stack

TL;DR

A CNN doesn't see a picture. It sees a 3D tensor of numbers, slides small learnable filters across it, and stacks the results into feature maps that go from edges → textures → parts → objects. Eight primitives — tensors, filters, feature maps, stride, padding, channels, pooling, receptive fields — explain every modern vision architecture. Get those right and ResNet, U-Net, YOLO, and the Stable Diffusion encoder all stop being magic.

What sparked this

A viral X post by @tetsuoai distilled the entire CNN stack into 16 boxes. The framing is simple: forget architectures for a second; learn the eight primitives, and the architectures fall out for free. This article is the long-form version with the actual numbers.

Why this mental model matters

Most people who try to learn CNNs get stuck in two places: shape mismatches in PyTorch ("why is my tensor (1, 64, 28, 28)?") and hyperparameter cargo-culting ("why does everyone use 3×3 stride 1 padding 1?"). Both collapse to the same root cause — the engineer never internalized what each primitive does to the tensor. Once you can mentally trace a 224×224×3 input through every layer and predict the output shape, you can read any vision paper in a single pass.

The eight primitives, with numbers

1. Tensors

A single RGB image is a tensor of shape (H, W, 3). A mini-batch is 4D: (N, H, W, C). CIFAR-10 is 32×32×3; ImageNet is typically 224×224×3. The network never "sees pixels" — it sees this tensor.

2. Filters (kernels)

A filter is a small learnable tensor — usually 3×3×C. The ×C is critical: filters are full-depth along channels but local along width and height. A 3×3 filter on RGB is really a 3×3×3 = 27-weight slab.

3. Feature maps

Slide the filter, dot-product at every position, write a scalar. The output is a 2D map showing where that pattern fires. Apply K filters, get K stacked feature maps — that's your next tensor's depth.

4. Stride

Stride S is how many pixels the filter jumps each step. S=1 = dense scan. S=2 = halve the output (and the compute). Used as a cheap downsample in modern nets that drop pooling.

5. Padding

Zeros added around the border so the filter can sit on edge pixels. With P = (F−1)/2 and stride 1, output size equals input size. This is exactly why "3×3, stride 1, padding 1" is the most common conv setting on Earth — it preserves dimensions, so you can stack 50 of them without size-tracking headaches.

6. Channels

Channels are feature detectors. RGB has 3 input channels. After a conv with 64 filters, you have 64 output channels — each one a different learned pattern. Don't confuse "channel depth" (third tensor axis) with "network depth" (number of layers); they share the word but mean different things.

7. Pooling

The standard is 2×2, stride 2, max. It throws away 75% of activations but keeps the strongest signal in each 2×2 window. Zero learnable parameters. Pooling is destructive — modern segmentation nets often skip it for strided conv instead.

8. Receptive fields

How much of the original image one neuron "sees". Stack three 3×3 conv layers and a single neuron in layer 3 has an effective 7×7 view of the input. Three 3×3s cost 3×9 = 27 parameters; one 7×7 costs 49. That's 45% fewer params plus two extra non-linearities — the entire reason VGG-style 3×3 stacks beat older designs.

The one formula that solves 95% of shape bugs

Output spatial size:

(W − F + 2P) / S + 1

Worked examples:

Input W	Filter F	Stride S	Padding P	Output
7	3	1	0	5
7	3	2	0	3
5	3	1	1	5 (size preserved)
227	11	4	0	55 (AlexNet conv1)

CNN vs MLP vs Vision Transformer

The architectural primitives matter because they're the trade-off you're picking:

Property	MLP	CNN	Vision Transformer
Input handling	Flatten 32×32×3 → 3072	Keep 3D tensor	Split into 16×16 patches
Locality bias	None	Strong (small filters)	Weak (attention)
Translation equivariance	No	Yes (param sharing)	Approximate
Data efficiency	Low	High	Low (needs huge pretraining)

CNNs win on small and medium data because the architecture itself encodes "nearby pixels matter" and "the same pattern repeats everywhere". ViTs only catch up past ~100M images of pretraining.

Who should care

Anyone learning ML. These eight primitives are load-bearing for understanding ResNet, U-Net, YOLO, the Stable Diffusion VAE encoder, and every diffusion-model conv block.
Engineers debugging shapes. Output mismatch errors in PyTorch are ~95% padding/stride bugs. The formula above solves them in 10 seconds.
Practitioners onboarding to vision. Knowing why 3×3, stride 1, padding 1 is the default saves a week of guessing.
Researchers reading papers. Receptive-field arithmetic explains why dilated convs, strided convs, and global average pooling exist.

Limitations & gotchas

CNNs are not rotation-invariant. They handle translation but a rotated cat looks like a different image — fix with augmentation.
Receptive field grows slowly. A 50-layer plain 3×3 CNN reaches only ~100×100 in theory, less in practice. This is why dilated convs and pooling exist.
Pooling is destructive. Bad for segmentation; modern nets often replace it with strided conv.
Effective ≠ theoretical receptive field. The effective field is roughly Gaussian and much smaller than the theoretical one — not all input pixels contribute equally (Distill, 2019).

What's next

The frontier in 2026 is hybrid: ConvNeXt, MaxViT, and similar architectures keep the CNN inductive biases (locality, translation equivariance, parameter sharing) but bolt on attention for global context. They dominate vision-on-a-budget benchmarks. On phones, MobileNetV4 and EfficientNet-V2 still rule because the parameter-efficiency story above is still the cheapest way to ship vision to a device.

The eight primitives don't go away. They just get composed differently.

Sources: CS231n, CNN Explainer, Distill — Receptive Fields, original 16-box post by @tetsuoai.