Most of the computer vision datasets are not able to easily generalize many of the aspects of today’s vision-based models. Creation of image datasets is a laborious task and also seem to have multiple limitations, having such restrictions over only a certain range of object categories. For overcoming such image label constraints, OpenAI is designing and developing their new neural network architecture CLIP (Contrastive Language-Image Pretraining) mainly created with the purpose for learning transferable visual models from natural language supervision.
OpenAI is already doing work with commending in the world of AI and deep learning, especially with GPT2 and GPT3. CLIP happens to be the extension of the very same. CLIP provides predictions with captions in images which are based on a simple pre-trained model which are more robust and scalable made with a state-of-the-art method for image recognition which is being built on a dataset of 400M image and text pairs that are scraped from the internet. Post-pre-training the model, natural language processing can be further used to match the already learned visual concepts that are capable of enabling zero-shot transfer learning. This then approaches performance measures done mainly by the testing and benchmarking of various existing vision datasets, these include action recognition in videos, optical character recognition, geo- localization and also many more features based mainly on object classification and other tasks,
CLIP begins by the implementation of many existent learning visual representations sourced from natural language supervision techniques. The process involves modern and advanced architecture such as the Vision and text Transformers, ICMLM, that explore masked language modelling, VirTex which is then applied to autoregressive language modelling and also use ConVIRT for the contrastive objective that is used in CLIP for imaging in the medical field.
Vision models usually functioned by being trained with an image featured extractor and then training a linear classifier for generating predictions labels for the assigned task. CLIP, however, works in a separate and better approach by incorporation of training both the image as well as the text encoder together/parallelly for predicting the accurate image-label pairs for a training batch.
Though under development and still having errors in the future this technology seems to be promising potential and may even help shape and change the current text to image technology.