Reading Note: Meta AI Open-Sources DINOv2: A New AI Method for Training High-Performance Computer Vision Models Based on Self-Supervised Learning

Paul Xiong
5 min readApr 19, 2023


original link:

Due to recent developments in AI, foundational computer vision models may now be pretrained using massive datasets. Producing general-purpose visual features, or features that function across picture distributions and jobs without fine-tuning, might considerably simplify the usage of images in any system, and these models hold considerable promise in this regard. (Note: Any Model, or Victor’s database reachs to this level? I don’t think so…) This study demonstrates that such features may be generated by current pretraining approaches, particularly self-supervised methods, when trained on sufficient curated data from various sources. (Note: the curated data is the problem, how do you find it if there are not, we saw SSL strangle with it when come to some domains, like medical. ) Meta AI has unveiled DINOv2, which is the first self-supervised learning method for training computer vision models that achieves performance on par with or better than the gold standard.

These visual characteristics are stable and perform well across domains without fine-tuning. They are produced using DINOv2 models, which can be directly used with classifiers as basic as linear layers on various computer vision applications. Pretrained models were fed 142 million photos without any labels or comments. (It holds an assumption: The pretrained model already shared the common “embeddings” as the new fed… the new 142 M feds are being used to train the classification model? if it NOT, then how the new fed contribute the the Pretrained models? will update this part after reading its source code. )

Because it does not require vast volumes of labeled data, self-supervised learning, the same approach used to develop state-of-the-art big language models for text applications, is a powerful and versatile way to train AI models. Models trained with the DINOv2 process do not require any information to be connected with the photos in the training set, making it similar to previous self-supervised systems. Imagine it as being able to learn from every given image, not only those with a predetermined set of tags or a predetermined set of alt text or a predetermined caption.

Essential Characteristics

  • DINOv2 is a novel approach to building high-performance computer vision models using self-supervised learning.
  • DINOv2 provides the unsupervised learning of high-quality visual features that may be used for both visual tasks at the picture level and the pixel level. Image categorization, instance retrieval, video comprehension, depth estimation, and many more tasks are covered.
  • Self-supervised learning is the main attraction here since it allows DINOv2 to build generic, flexible frameworks for various computer vision tasks and applications. Fine-tuning of the model is not required before applying it to different domains. This is the pinnacle of unsupervised learning.
  • Creating a large-scale, highly-curated, diversified dataset for training the models is also an integral part of this study. There are 142 million photos in the data collection.
  • More efficient implementations that decrease factors like memory utilization and processor requirements are another algorithmic endeavor to stabilize the training of bigger models.
  • Researchers have also published the pretrained models for DINOv2. Checkpoints for ViT models published on PyTorch Hub are also included in the pretraining code and recipe for Vision Transformer models.


  • Simple linear classifiers can take advantage of the high-performance features provided by DINOv2.
  • DINOv2’s adaptability may be used to build general-purpose infrastructures for various computer vision applications.
  • Features perform much better than in-domain and out-of-domain state-of-the-art depth estimation methods.
  • The skeleton stays generic without fine-tuning, and the same features may be employed concurrently across numerous activities.
  • The DINOv2 model family performs on par with weakly-supervised features (WSL), which is a significant improvement on the prior state of the art in self-supervised learning (SSL).
  • The features generated by DINOv2 models are useful as-is, demonstrating the models’ superior out-of-distribution performance.
  • DINOv2’s reliance on self-supervision means it can study any picture database. In addition, it can pick up on aspects, such as depth estimates, that the status quo method cannot.

Having to rely on human annotations of pictures is a stumbling block since it reduces the data available for model training. Images can be extremely challenging to classify in highly specialized application fields. For instance, it is difficult to train machine learning models using labeled cellular imaging because there need to be more specialists to annotate the cells at the necessary scale. To facilitate the comparison of established therapies with novel ones, for instance, self-supervised training on microscopic cellular photography paves the way for fundamental cell imagery models and, by extension, biological discovery.

Discarding extraneous photos and balancing the dataset across concepts are crucial in constructing a large-scale pretraining dataset from such a source. Training more complex architectures is a vital part of the effort, and to improve performance, these models need access to more information. However, getting your hands on further details is only sometimes feasible. Researchers investigated using a publicly available collection of crawled web data. They fashioned a process to choose meaningful data inspired by LASER because there was no large enough curated dataset to meet the demands.

The next step is to use this model as a building element in a more sophisticated AI system that can engage in dialogue with substantial linguistic models. Complex AI systems can reason more thoroughly about pictures if they have access to a visual backbone supplying rich information on images than is possible with a single text phrase.

Check out the Paper, and Reference Article. Don’t forget to join our 19k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

🚀 Check Out 100’s AI Tools in AI Tools Club

Originally published at on April 19, 2023.



Paul Xiong

Coding, implementing, optimizing ML annotation with self-supervised learning, TLDR: doctor’s labeling is the 1st priority for our Cervical AI project.