Comment by yorwba

7 hours ago

The comparison is against general models which are explicitly fine-tuned. Specifically, they pre-train their models on unlabeled in-domain images and take DINO models pre-trained on internet-scale general images, then fine-tune both of them on a small number of labeled in-domain images.

The idea is to show that unsupervised pre-training on your target data, even if you don't have a lot of it, can beat transfer learning from a larger, but less focused dataset.