The article explores zero-shot learning, a machine learning technique that classifies unseen examples, focusing on zero-shot image classification. It discusses the mechanics of zero-shot image classification, implementation methods, benefits and challenges, practical applications, and future directions.
A machine learning technique known as “zero-shot learning” (ZSL) allows a model to identify or classify examples of a class that were not present during training. The goal of this method is to close the gap between the enormous number of classes that are present in the real world and the small number of classes that may be used to train a model.
One particular application of zero-shot learning is zero-shot classification, which focuses on classifying instances—including ones that are absent from the training set—into classes.
.Zero-shot classification examples include:
This classification is a specific type of zero-shot classification applied to visual data. It allows models to classify images into categories they haven’t explicitly seen during training.
Key differences from traditional image classification:
First, we need to install dependencies :
!pip install -q "transformers[torch]" pillow
There are two main approaches to implementing zero-shot image classification:
from transformers import pipeline
from PIL import Image
import requests
# Set up the pipeline
checkpoint = "openai/clipvitlargepatch14"
detector = pipeline(model=checkpoint, task="zeroshotimageclassification")
url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTuC7EJxlBGYl8-wwrJbUTHricImikrH2ylFQ&s"
image = Image.open(requests.get(url, stream=True).raw)
image
# Perform classification
predictions = detector(image, candidate_labels=["fox", "bear", "seagull", "owl"])
predictions
# Find the dictionary with the highest score
best_result = max(predictions, key=lambda x: x['score'])
# Print the label and score of the best result
print(f"Label with the best score: {best_result['label']}, Score: {best_result['score']}")
Output :
from transformers import AutoProcessor, AutoModelForZeroShotImageClassification
import torch
from PIL import Image
import requests
# Load model and processor
checkpoint = "openai/clipvitlargepatch14"
model = AutoModelForZeroShotImageClassification.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)
# Load an image
url = "https://unsplash.com/photos/xBRQfR2bqNI/download?ixid=MnwxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNjc4Mzg4ODEx&force=true&w=640"
image = Image.open(requests.get(url, stream=True).raw)
Image
# Prepare inputs
candidate_labels = ["tree", "car", "bike", "cat"]
inputs = processor(images=image, text=candidate_labels, return_tensors="pt", padding=True)
# Perform inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits_per_image[0]
probs = logits.softmax(dim=1).numpy()
# Process results
result = [
{"score": float(score), "label": label}
for score, label in sorted(zip(probs, candidate_labels), key=lambda x: x[0])
]
print(result)
# Find the dictionary with the highest score
best_result = max(result, key=lambda x: x['score'])
# Print the label and score of the best result
print(f"Label with the best score: {best_result['label']}, Score: {best_result['score']}")
Also Read: Build Your First Image Classification Model in Just 10 Minutes!
A major development in computer vision and machine learning is zero-shot image classification, which is based on the more general idea of zero-shot learning. By enabling models to classify images into previously unseen categories, this technology offers unprecedented flexibility and adaptability. Future research should yield even more potent and flexible systems that can easily adjust to novel visual notions, possibly upending a wide range of sectors and applications.
A. Traditional image classification requires labeled examples for each class it can recognize, while this can categorize images into classes it hasn’t explicitly seen during training.
A. It uses multi-modal models trained on large datasets of images and text descriptions. These models learn to create aligned representations of visual and textual information, allowing them to match new images with textual descriptions of categories.
A. The key advantages include flexibility to classify into new categories without retraining, scalability to new domains, reduced dependency on labeled data, and the ability to use natural language for specifying categories.
A. Yes, some limitations include potentially lower accuracy compared to specialized models, difficulty with subtle distinctions between similar categories, potentially inherited biases, and higher computational requirements.
A. Applications include content moderation, e-commerce product categorization, medical imaging for rare conditions, wildlife monitoring, and object recognition in robotics.