The Speck in the Frame: Why Cutting a Picture Into Tiles Lets a Computer See What It Was Missing

A person on a far sidewalk, a deer at the road's edge, a drone two fields away — all there in the pixels, all invisible to the model. Tiling is the cheap fix.

A child stands on a far corner, half a block down. To your eye, plain as day. You run the footage through a perfectly good object detector and the box never appears. The pixels are there. The model just never looked closely enough. That gap — between what's in the image and what a machine reports — is where small objects go to die, and there's a simple, well-documented trick that brings a lot of them back.

The thing nobody tells you about image size

Most object detectors do something quietly destructive before they ever start looking: they shrink your image.

A typical detection model expects a fixed input — often something like 640 by 640 pixels. Your camera might shoot 4K, which is roughly 3840 by 2160. So the frame gets squashed down to fit. A car that filled 300 pixels of width is still big after the squash. But a pedestrian a block away, who only occupied 25 pixels to begin with, might shrink to four or five pixels. A deer at the tree line becomes a smudge. A distant drone becomes nothing.

Here's the part people feel before they see a chart: the model isn't broken, and the object isn't absent. The detail was thrown away in the resize, before the network got a vote. You can stare at the original file and point right at the thing the computer swears isn't there.

Resolution versus field of view — the whole trade in one sentence

Every fixed-input detector is forced into the same bargain. You can show it a wide field of view at low effective resolution (the whole shrunk-down frame), or a narrow field of view at high resolution (a small crop at native pixels). You cannot have both at once, because the input box is a fixed size.

Whole-frame detection picks "wide and low." That's great for big, near objects and terrible for small, far ones. The information that distinguishes a person from a fence post lives in fine detail, and fine detail is exactly what the downscale destroys.

Tiling picks "narrow and high" — many times over.

The tiling idea, plainly

Instead of feeding the model one shrunken copy of the whole scene, you cut the original full-resolution image into a grid of smaller tiles — say a 3-by-3 patchwork — and run the detector on each tile separately. Each tile is closer to the model's native input size, so it barely gets downscaled at all. That far-off pedestrian who was five pixels in the whole frame might be sixty or eighty pixels inside their tile. Suddenly the model has something to work with, and the box appears.

Then you stitch the results back together: take the detections from every tile, translate their coordinates back into the full image, and merge overlaps so the same object isn't counted twice across a seam.

This approach is publicly documented under the name SAHI — Slicing Aided Hyper Inference, introduced in a 2022 paper by Akyon, Altinuc, and Temizel and released as open-source tooling. The clever, practical detail in that work: tiles are cut with a bit of overlap between neighbors. Without overlap, an object straddling a tile boundary gets sliced in half, and a half-object is often unrecognizable. The overlap means most objects land whole inside at least one tile.

A second trick from the same line of work is to run tiled inference and a normal full-frame pass, then fuse them. The tiles catch the small stuff; the full frame still handles the large objects that might be bigger than a single tile. You merge both sets of boxes — typically with non-maximum suppression, which collapses heavily overlapping detections of the same thing into one.

Why this beats "just use a bigger model"

The honest appeal of tiling is that it needs no retraining. You take a detector you already have, off the shelf, and change only how you feed it images. The model's weights never move. That's why the technique spread fast in aerial imagery, satellite analysis, wildlife surveys, and traffic monitoring — anywhere the interesting objects are routinely small relative to the frame.

You could instead train a model to accept much larger inputs. But input size drives compute and memory roughly with the square of the dimension, so doubling resolution can roughly quadruple the cost, and you hit hardware limits quickly. Tiling spends compute more surgically: you pay for extra passes, but each pass is cheap and standard-sized, and you can skip empty sky or run tiles in parallel.

The costs, stated honestly

Tiling is not free, and pretending otherwise would be spin.

More inference passes. Nine tiles plus a full frame is ten forward passes instead of one. That's slower and more expensive per image — a real problem for high-frame-rate, low-latency video, and the main reason you wouldn't tile everything by default.
Seam bookkeeping. Overlap and merging add code and edge cases. Get the overlap too small and you cut objects; too large and you waste compute and create duplicate detections to reconcile.
It only helps small objects. If everything in your scene is already big in the frame, tiling adds cost for nothing. The win scales with how tiny-relative-to-frame your targets are.
Tile-local blindness. A model looking at one tile loses the surrounding context. Usually fine for detection, occasionally not — a thing that only makes sense given its neighborhood can get misread in isolation.

The takeaway

The lesson generalizes past any one tool. A model can only reason about the pixels it's actually shown, and the unglamorous preprocessing step — that quiet resize — often decides what's even possible downstream. When small or distant objects keep slipping through, the first question isn't "do I need a smarter model." It's "did the detail survive long enough to be seen at all."

Tiling answers that by refusing to throw the detail away. Cut the frame, keep the pixels, look closely, then put it back together. The speck on the far corner was always there. Sometimes seeing it is just a matter of not squinting at the whole street at once.