Adapting Grad-CAM for Image-to-Text Models: A Step-by-Step Guide | by meet patel | May, 2024

Grad-CAM is a strong visualization device initially designed for CNN architectures to focus on what components of a picture affect neural community selections. In the present day, I’ll present you ways I’ve tailored Grad-CAM to work with an image-to-text transformer mannequin, particularly utilizing the TrOCR mannequin from Hugging Face.

Step 1: Token Era from the Mannequin

Step one entails producing tokens from our TrOCR mannequin. These tokens are primarily the mannequin’s interpretation of the picture in a textual format, which we’ll later use for gradient computation.

import torch
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Picture
import matplotlib.pyplot as plt
import numpy as npprocessor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
mannequin = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")
image_path = "your_image_path.jpg"
picture = Picture.open(image_path).convert("RGB")
def get_generated_tokens(picture, mannequin, processor):
pixel_values = processor(pictures=picture, return_tensors="pt").pixel_values
generated_tokens = mannequin.generate(pixel_values=pixel_values, max_length=50)
return generated_tokens, pixel_values

generated tokens will look one thing like tensor([[ 2, 14200, 2022, 2]]). the place ‘2’ token is the particular token, representing the beginning and finish.

Step 2: Layer Choice for Grad-CAM

Selecting the best layer is essential as a result of the effectiveness of Grad-CAM is dependent upon capturing related activations that correlate with output predictions. In transformers, that is sometimes one of many closing layers.

If you’re uncertain concerning the layer and output form merely print the mannequin or use torchsummary library for detailed output shapes of every layer.

For the above mannequin, I’ve chosen the final layer of the ViT encoder.

layer_name = mannequin.encoder.encoder.layer[-1].output

Word: Right here I’ve used .output within the layer_name as a result of Huggingface mannequin can return a dictionary or tuple, if it’s a torch mannequin simply the title of the layer is nice sufficient.

Step 3: Attaching Hooks to Seize Outputs and Gradients

We connect a ahead hook to the chosen layer to seize the outputs in the course of the ahead move and retain them for computing gradients in the course of the backward move.

last_layer_output = None
def save_output(module, enter, output):
world last_layer_output
last_layer_output = output
output.retain_grad()last_layer = layer_name
last_layer.register_forward_hook(save_output)

Step 4: Concentrating on Particular Tokens

We choose particular tokens to compute how a lot every a part of the enter picture contributed to predicting that token, offering insights into mannequin selections.

Step 5: Reshaping Layer Outputs

Transformers output activations in a unique format in comparison with CNNs. We remodel these to imitate CNN function maps, enabling us to use Grad-CAM successfully:

Understanding the output form of the chosen layer:

(Batch_size, Tokens, Options or Channels) ->(1, 577,796)
Take away the primary token [CLS] whether it is ViT ->(1, 576, 796)
If the function map is sq., which is true on this case ->(1, 24, 24, 796)
Apply transpose, so options turns into similar to CNN ->(1, 796, 24, 24)

def reshape_transform_vit_huggingface(x):
activations = x[:, 1:, :] # Take away the primary token which is used for classification in some architectures
side_length = int(np.sqrt(activations.form[1])) # Assuming the function map is sq.
activations = activations.view(activations.form[0], side_length, side_length, activations.form[2])
activations = activations.transpose(2, 3).transpose(1, 2)
return activations

Step 6: Making use of Grad-CAM

Lastly, we apply the Grad-CAM algorithm to focus on the essential areas of the picture for every token. The algorithm makes use of gradients of the goal token wrt the activations from our chosen layer, weighted and summed to create a heatmap.

transform_output = reshape_transform_vit_huggingface(layer_output)
transform_grad = reshape_transform_vit_huggingface(grad)
weights = torch.imply(transform_grad, dim=(2, 3), keepdim=True)  # Common throughout the spatial dimensions# Step 2: Weighted mixture of activation maps
grad_cam = torch.sum(weights * transform_output, dim=1, keepdim=True)  # Sum over the function maps
# Step 3: Apply ReLU
grad_cam = torch.relu(grad_cam)  # Solely take optimistic contributions
grad_cam = grad_cam.squeeze(0)  # Take away batch dimension for visualization
# Step 4: Normalize (elective however helps in visualization)
grad_cam = grad_cam / grad_cam.max()
print("Grad-CAM form:", grad_cam.form)
heatmap = torch.nn.practical.interpolate(grad_cam.unsqueeze(0), measurement=(picture.measurement[1], picture.measurement[0]), mode='bilinear', align_corners=False)
heatmap = heatmap.squeeze().detach().numpy()

Implementation in Python

Right here’s the whole Python code that accomplishes all of the above steps utilizing PyTorch, PIL for picture dealing with, and matplotlib for visualization:

import torch
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Picture
import matplotlib.pyplot as plt
import numpy as np# Load the pre-trained processor and mannequin
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
mannequin = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")
# Load and course of the picture
image_path = "00809.jpg"
picture = Picture.open(image_path).convert("RGB")
def reshape_transform_vit_huggingface(x):
activations = x[:, 1:, :]
x = np.sqrt(activations.form[1])
activations = activations.view(activations.form[0], int(x), int(x), activations.form[2])
activations = activations.transpose(2, 3).transpose(1, 2)
return activations
def get_generated_tokens(picture, mannequin, processor):
pixel_values = processor(pictures=picture, return_tensors="pt").pixel_values
# Ahead move
generated_tokens = mannequin.generate(pixel_values=pixel_values, max_length=50)
return generated_tokens, pixel_values
last_layer_output = None
def get_activations_and_gradient(pixel_values, mannequin, processor, generated_tokens, layer_name, token_index = 0):
textual content = processor.decode(generated_tokens[0, token_index], skip_special_tokens=False)
def save_output(module, enter, output):
world last_layer_output
last_layer_output = output
output.retain_grad()
last_layer = layer_name
last_layer.register_forward_hook(save_output)
outputs = mannequin(pixel_values=pixel_values, decoder_input_ids=generated_tokens[:, :-1], return_dict=True)
# Backward move on a specific logit
selected_logit = outputs.logits[0, token_index, generated_tokens[0, token_index]]
selected_logit.backward()
return last_layer_output, last_layer_output.grad, textual content
def apply_gradcam(layer_output, grad, picture, index, textual content):
from skimage import coloration
transform_output = reshape_transform_vit_huggingface(layer_output)
transform_grad = reshape_transform_vit_huggingface(grad)
weights = torch.imply(transform_grad, dim=(2, 3), keepdim=True)  # Common throughout the spatial dimensions
# Step 2: Weighted mixture of activation maps
grad_cam = torch.sum(weights * transform_output, dim=1, keepdim=True)  # Sum over the function maps
# Step 3: Apply ReLU
grad_cam = torch.relu(grad_cam)  # Solely take optimistic contributions
grad_cam = grad_cam.squeeze(0)  # Take away batch dimension for visualization
# Step 4: Normalize (elective however helps in visualization)
grad_cam = grad_cam / grad_cam.max()
print("Grad-CAM form:", grad_cam.form)
heatmap = torch.nn.practical.interpolate(grad_cam.unsqueeze(0), measurement=(picture.measurement[1], picture.measurement[0]), mode='bilinear', align_corners=False)
heatmap = heatmap.squeeze().detach().numpy()
blended = Picture.mix(picture.convert('RGBA'), Picture.fromarray((plt.cm.jet(heatmap)* 255).astype(np.uint8)).convert('RGBA'), alpha=0.5)
blended.save(f"blended_image_{index}.png", format='PNG')
return {f"{textual content}": f"blended_image_{index}.png"}
layer_name = mannequin.encoder.encoder.layer[-1].output
generated_tokens, pixel_values = get_generated_tokens(picture, mannequin, processor)
print(generated_tokens)
for index, tokens in enumerate(generated_tokens[:, :-1].numpy().tolist()[0]):
layer_output, grad, textual content = get_activations_and_gradient(pixel_values, mannequin, processor, generated_tokens, layer_name, token_index=index)
knowledge = apply_gradcam(layer_output, grad, picture, index, textual content)
print(knowledge)

GradCAM for the generated token ‘14200’, if we decode the token it represents ‘0000’.

GradCAM for the generated token ‘2022’, if we decode the token it represents ‘35.

By adapting Grad-CAM to be used with a transformer mannequin, we are able to acquire insights into which components of the picture the mannequin focuses on when producing textual content. This method could be extremely helpful for debugging and enhancing mannequin efficiency, significantly in purposes like automated content material description and OCR.

I hope you discovered this information useful. For extra insights and discussions on expertise and innovation, be happy to observe me on Linkedin: www.linkedin.com/in/meetvpatel. I sit up for connecting with you!

Source link

Adapting Grad-CAM for Image-to-Text Models: A Step-by-Step Guide | by meet patel | May, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

LogicMonitor Seeks to Disrupt AI Landscape with an $800 Million Strategic Investment at a Valuation of Approximately $2.4 Billion to Revolutionize Data Centers

Denodo Platform 9.1 Brings New Advanced AI Capabilities and Enhanced Data Lakehouse Performance

Harnessing AI in Agriculture – insideAI News

How Big Data Is Transforming Patient Care Delivery

How to Assist Human Agents & Transform Customer Experience with Conversational AI?

Our Picks

AI-Generated Images and Model Collapse: What You Need to Know | by Muhammad Al Terra | May, 2024

What is the AI Alignment Problem and why is it important? | by Sahin Ahmed, Data Scientist | Jun, 2024

Machine Learning || SMS Spam Detection System Part I | by Jay Parmar | May, 2024

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Adapting Grad-CAM for Image-to-Text Models: A Step-by-Step Guide | by meet patel | May, 2024

Step 1: Token Era from the Mannequin

Step 2: Layer Choice for Grad-CAM

Step 3: Attaching Hooks to Seize Outputs and Gradients

Step 4: Concentrating on Particular Tokens

Step 5: Reshaping Layer Outputs

Step 6: Making use of Grad-CAM

Implementation in Python

Related Posts