Adapting Grad-CAM for Image-to-Text Models: A Step-by-Step Guide | by meet patel | May, 2024

Grad-CAM is a robust visualization machine initially designed for CNN architectures to give attention to what parts of an image have an effect on neural group picks. Within the current day, I’ll current you methods I’ve tailor-made Grad-CAM to work with an image-to-text transformer model, notably using the TrOCR model from Hugging Face.

Step 1: Token Period from the Model

The first step entails producing tokens from our TrOCR model. These tokens are primarily the model’s interpretation of the image in a textual format, which we’ll later use for gradient computation.

import torch
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import matplotlib.pyplot as plt
import numpy as npprocessor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")
image_path = "your_image_path.jpg"
image = Image.open(image_path).convert("RGB")
def get_generated_tokens(image, model, processor):
pixel_values = processor(footage=image, return_tensors="pt").pixel_values
generated_tokens = model.generate(pixel_values=pixel_values, max_length=50)
return generated_tokens, pixel_values

generated tokens will look one factor like tensor([[ 2, 14200, 2022, 2]]). the place ‘2’ token is the actual token, representing the start and end.

Step 2: Layer Selection for Grad-CAM

Selecting the right layer is important on account of the effectiveness of Grad-CAM depends upon capturing associated activations that correlate with output predictions. In transformers, that’s generally one in every of many closing layers.

If you happen to’re unsure regarding the layer and output type merely print the model or use torchsummary library for detailed output shapes of each layer.

For the above model, I’ve chosen the ultimate layer of the ViT encoder.

layer_name = model.encoder.encoder.layer[-1].output

Phrase: Proper right here I’ve used .output throughout the layer_name on account of Huggingface model can return a dictionary or tuple, if it is a torch model merely the title of the layer is sweet enough.

Step 3: Attaching Hooks to Seize Outputs and Gradients

We join a forward hook to the chosen layer to grab the outputs in the middle of the forward transfer and retain them for computing gradients in the middle of the backward transfer.

last_layer_output = None
def save_output(module, enter, output):
world last_layer_output
last_layer_output = output
output.retain_grad()last_layer = layer_name
last_layer.register_forward_hook(save_output)

Step 4: Concentrating on Explicit Tokens

We select specific tokens to compute how lots each part of the enter image contributed to predicting that token, providing insights into model picks.

Step 5: Reshaping Layer Outputs

Transformers output activations in a singular format as compared with CNNs. We rework these to mimic CNN perform maps, enabling us to make use of Grad-CAM efficiently:

Understanding the output type of the chosen layer:

(Batch_size, Tokens, Choices or Channels) ->(1, 577,796)
Take away the first token [CLS] whether or not it’s ViT ->(1, 576, 796)
If the perform map is sq., which is true on this case ->(1, 24, 24, 796)
Apply transpose, so choices turns into just like CNN ->(1, 796, 24, 24)

def reshape_transform_vit_huggingface(x):
activations = x[:, 1:, :] # Take away the first token which is used for classification in some architectures
side_length = int(np.sqrt(activations.type[1])) # Assuming the perform map is sq.
activations = activations.view(activations.type[0], side_length, side_length, activations.type[2])
activations = activations.transpose(2, 3).transpose(1, 2)
return activations

Step 6: Making use of Grad-CAM

Lastly, we apply the Grad-CAM algorithm to give attention to the important areas of the image for each token. The algorithm makes use of gradients of the aim token wrt the activations from our chosen layer, weighted and summed to create a heatmap.

transform_output = reshape_transform_vit_huggingface(layer_output)
transform_grad = reshape_transform_vit_huggingface(grad)
weights = torch.indicate(transform_grad, dim=(2, 3), keepdim=True)  # Frequent all through the spatial dimensions# Step 2: Weighted combination of activation maps
grad_cam = torch.sum(weights * transform_output, dim=1, keepdim=True)  # Sum over the perform maps
# Step 3: Apply ReLU
grad_cam = torch.relu(grad_cam)  # Solely take optimistic contributions
grad_cam = grad_cam.squeeze(0)  # Take away batch dimension for visualization
# Step 4: Normalize (elective nonetheless helps in visualization)
grad_cam = grad_cam / grad_cam.max()
print("Grad-CAM type:", grad_cam.type)
heatmap = torch.nn.sensible.interpolate(grad_cam.unsqueeze(0), measurement=(image.measurement[1], image.measurement[0]), mode='bilinear', align_corners=False)
heatmap = heatmap.squeeze().detach().numpy()

Implementation in Python

Proper right here’s the entire Python code that accomplishes all the above steps using PyTorch, PIL for image coping with, and matplotlib for visualization:

import torch
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np# Load the pre-trained processor and model
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")
# Load and course of the image
image_path = "00809.jpg"
image = Image.open(image_path).convert("RGB")
def reshape_transform_vit_huggingface(x):
activations = x[:, 1:, :]
x = np.sqrt(activations.type[1])
activations = activations.view(activations.type[0], int(x), int(x), activations.type[2])
activations = activations.transpose(2, 3).transpose(1, 2)
return activations
def get_generated_tokens(image, model, processor):
pixel_values = processor(footage=image, return_tensors="pt").pixel_values
# Forward transfer
generated_tokens = model.generate(pixel_values=pixel_values, max_length=50)
return generated_tokens, pixel_values
last_layer_output = None
def get_activations_and_gradient(pixel_values, model, processor, generated_tokens, layer_name, token_index = 0):
textual content material = processor.decode(generated_tokens[0, token_index], skip_special_tokens=False)
def save_output(module, enter, output):
world last_layer_output
last_layer_output = output
output.retain_grad()
last_layer = layer_name
last_layer.register_forward_hook(save_output)
outputs = model(pixel_values=pixel_values, decoder_input_ids=generated_tokens[:, :-1], return_dict=True)
# Backward transfer on a selected logit
selected_logit = outputs.logits[0, token_index, generated_tokens[0, token_index]]
selected_logit.backward()
return last_layer_output, last_layer_output.grad, textual content material
def apply_gradcam(layer_output, grad, image, index, textual content material):
from skimage import coloration
transform_output = reshape_transform_vit_huggingface(layer_output)
transform_grad = reshape_transform_vit_huggingface(grad)
weights = torch.indicate(transform_grad, dim=(2, 3), keepdim=True)  # Frequent all through the spatial dimensions
# Step 2: Weighted combination of activation maps
grad_cam = torch.sum(weights * transform_output, dim=1, keepdim=True)  # Sum over the perform maps
# Step 3: Apply ReLU
grad_cam = torch.relu(grad_cam)  # Solely take optimistic contributions
grad_cam = grad_cam.squeeze(0)  # Take away batch dimension for visualization
# Step 4: Normalize (elective nonetheless helps in visualization)
grad_cam = grad_cam / grad_cam.max()
print("Grad-CAM type:", grad_cam.type)
heatmap = torch.nn.sensible.interpolate(grad_cam.unsqueeze(0), measurement=(image.measurement[1], image.measurement[0]), mode='bilinear', align_corners=False)
heatmap = heatmap.squeeze().detach().numpy()
blended = Image.combine(image.convert('RGBA'), Image.fromarray((plt.cm.jet(heatmap)* 255).astype(np.uint8)).convert('RGBA'), alpha=0.5)
blended.save(f"blended_image_{index}.png", format='PNG')
return {f"{textual content material}": f"blended_image_{index}.png"}
layer_name = model.encoder.encoder.layer[-1].output
generated_tokens, pixel_values = get_generated_tokens(image, model, processor)
print(generated_tokens)
for index, tokens in enumerate(generated_tokens[:, :-1].numpy().tolist()[0]):
layer_output, grad, textual content material = get_activations_and_gradient(pixel_values, model, processor, generated_tokens, layer_name, token_index=index)
data = apply_gradcam(layer_output, grad, image, index, textual content material)
print(data)

GradCAM for the generated token ‘14200’, if we decode the token it represents ‘0000’.

GradCAM for the generated token ‘2022’, if we decode the token it represents ‘35.

By adapting Grad-CAM for use with a transformer model, we’re capable of purchase insights into which parts of the image the model focuses on when producing textual content material. This methodology may very well be extraordinarily useful for debugging and enhancing model effectivity, considerably in functions like automated content material materials description and OCR.

I hope you found this data helpful. For additional insights and discussions on experience and innovation, be comfortable to watch me on Linkedin: www.linkedin.com/in/meetvpatel. I stay up for connecting with you!

Source link

Adapting Grad-CAM for Image-to-Text Models: A Step-by-Step Guide | by meet patel | May, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

LogicMonitor Seeks to Disrupt AI Landscape with an $800 Million Strategic Investment at a Valuation of Approximately $2.4 Billion to Revolutionize Data Centers

Denodo Platform 9.1 Brings New Advanced AI Capabilities and Enhanced Data Lakehouse Performance

Harnessing AI in Agriculture – insideAI News

How Big Data Is Transforming Patient Care Delivery

How to Assist Human Agents & Transform Customer Experience with Conversational AI?

Our Picks

Ethical Considerations in Data Science: Addressing Bias and Fairness in Algorithms | by SanjanaBIA | Jun, 2024

How to Run Stable Diffusion with ONNX | by Julia Turc | May, 2024

InsideAI News on the Move: Join Us at the Premier AI Industry Events!

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Adapting Grad-CAM for Image-to-Text Models: A Step-by-Step Guide | by meet patel | May, 2024

Step 1: Token Period from the Model

Step 2: Layer Selection for Grad-CAM

Step 3: Attaching Hooks to Seize Outputs and Gradients

Step 4: Concentrating on Explicit Tokens

Step 5: Reshaping Layer Outputs

Step 6: Making use of Grad-CAM

Implementation in Python

Related Posts