← Back to context

Comment by menaerus

1 day ago

Using transformers does not mutually exclude other tools in the sleeve.

What about DINOv2 and DINOv3, 1B and 7B, vision transformer models? This paper [1] suggests significant improvements over traditional YOLO-based object detection.

[1] https://arxiv.org/html/2509.20787v2

Indeed, there are even multiple attempts to use both self-attention and convolutions in novel architectures, and there is evidence this works very well and may have significant advantages over pure vision transformer models [1-2].

IMO there is little reason to think transformers are (even today) the best architecture for any deep learning application. Perhaps if a mega-corp poured all their resources into some convolutional transformer architecture, you'd get something better than just the current vision transformer (ViT) models, but, since so much optimizations and work on the training of ViTs has been done, and since we clearly still haven't maxed out their capacity, it makes sense to stick with them at scale.

That being said, ViTs are still currently clearly the best if you want something trained on a near-entire-internet of image or video data.

[1] https://arxiv.org/abs/2103.15808

[2] https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&q=convo...