Comment by D-Machine

2 days ago

No, not at all. There is a transformer obsession that is quite possibly not supported by the actual facts (CNNs can still do just as well: https://arxiv.org/abs/2310.16764), and CNNs definitely remain preferable for smaller and more specialized tasks (e.g. computer vision on medical data).

If you also get into more robust and/or specialized tasks (e.g. rotation invariant computer vision models, graph neural networks, models working on point-cloud data, etc) then transformers are also not obviously the right choice at all (or even usable in the first place). So plenty of other useful architectures out there.

Using transformers does not mutually exclude other tools in the sleeve.

What about DINOv2 and DINOv3, 1B and 7B, vision transformer models? This paper [1] suggests significant improvements over traditional YOLO-based object detection.

[1] https://arxiv.org/html/2509.20787v2

  • Indeed, there are even multiple attempts to use both self-attention and convolutions in novel architectures, and there is evidence this works very well and may have significant advantages over pure vision transformer models [1-2].

    IMO there is little reason to think transformers are (even today) the best architecture for any deep learning application. Perhaps if a mega-corp poured all their resources into some convolutional transformer architecture, you'd get something better than just the current vision transformer (ViT) models, but, since so much optimizations and work on the training of ViTs has been done, and since we clearly still haven't maxed out their capacity, it makes sense to stick with them at scale.

    That being said, ViTs are still currently clearly the best if you want something trained on a near-entire-internet of image or video data.

    [1] https://arxiv.org/abs/2103.15808

    [2] https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&q=convo...

Is there something I can read to get a better sense of what types of models are most suitable for which problems? All I hear about are transformers nowadays, but what are the types of problems for which transformers are the right architecture choice?

  • Just do some basic searches on e.g. Google Scholar for your task (e.g. "medical image segmentation", "point cloud segmentation", "graph neural networks", "timeseries classification", "forecasting") or task modification (e.g. "'rotation invariant' architecture") or whatever, sort by year, make sure to click on papers that have a large number of citations, and start reading. You will start to get a feel for domains or specific areas where transformers are and are not clearly the best models. Or just ask e.g. ChatGPT Thinking with search enabled about these kinds of things (and then verify the answer by going to the actual papers).

    Also check HuggingFace and other model hubs and filter by task to see if any of these models are available in an easy-to-use format. But most research models will only be available on GitHub somewhere, and in general you are just deciding between a vision transformer and the latest convolutional model (usually a ConvNext vX for some X).

    In practice, if you need to work with the kind of data that is found online, and don't have a highly specialized type of data or problem, then you do, today, almost always just want some pre-trained transformer.

    But if you actually have to (pre)train a model from scratch on specialized data, in many cases you will not have enough data or resources to get the most out of a transformer, and often some kind of older / simpler convolutional model is going to give better performance at less cost. Sometimes in these cases you don't even want a deep-learner at all, and just classic ML or algorithms are far superior. A good example would be timeseries forecasting, where embarrassingly simple linear models blow overly-complicated and hugely expensive transformer models right out of the water (https://arxiv.org/abs/2205.13504).

    Oh, right, and unless TabPFNv2 (https://www.nature.com/articles/s41586-024-08328-6) makes sense for your use-case, you are still better off using boosted decision trees (e.g. XGBoost, LightGBM, or CatBoost) for tabular data.