One different useful library chance for pre-processing is OpenCV, as a result of it provides an expanded set of capabilities, along with superior manipulations like resizing, deskewing, and rotating. We’re going to nonetheless not dig deeper into OpenCV proper right here.
On this technique, we use Tesseract with ‘pytesseract’ for OCR on the PDF photographs. Tesseract’s choices embrace:
- Multilingual Help: It helps quite a few languages which allows it to course of a numerous differ of paperwork.
- Open-Provide Profit: As an open-source instrument, Tesseract is an inexpensive reply for OCR with good top quality.
- Customization Options: Tesseract’s engine is perhaps fine-tuned for explicit paperwork or distinctive fonts to boost effectivity.
Whereas Tesseract provides a secure baseline, cloud-based firms like Azure Doc Intelligence provide one other with doubtlessly increased effectivity nevertheless is perhaps costly.
def ocr_pdf(pdf_pages):
"""
Perform OCR on each net web page of a PDF represented as photographs, extracting textual content material and metadata.This carry out iterates over each net web page, applies OCR to extract textual content material along with its
positional metadata, and aggregates the outcomes proper right into a structured pandas DataFrame.
Parameters:
- pdf_pages (guidelines of PIL.Image.Image): Guidelines of PDF pages reworked into photographs.
Returns:
- pd.DataFrame: A DataFrame containing OCR extracted textual content material and metadata for each acknowledged element.
"""
ocr_data = [] # Initialize a list to hold OCR outcomes
# Course of each net web page for OCR
for page_num, net web page in tqdm(enumerate(pdf_pages), desc="Processing OCR"):
# Apply OCR to extract detailed information as a dictionary
ocr_output = pytesseract.image_to_data(net web page, output_type='dict')
# Iterate through OCR output, filtering out empty textual content material components
for i, textual content material in enumerate(ocr_output['text']):
if textual content material.strip(): # Assure textual content material should not be empty
# Append OCR info with metadata to the guidelines
ocr_data.append({
"stage": ocr_output['level'][i],
"page_num": page_num,
"block_num": ocr_output['block_num'][i],
"par_num": ocr_output['par_num'][i],
"line_num": ocr_output['line_num'][i],
"word_num": ocr_output['word_num'][i],
"left": ocr_output['left'][i],
"excessive": ocr_output['top'][i],
"width": ocr_output['width'][i],
"prime": ocr_output['height'][i],
"conf": ocr_output['conf'][i],
"textual content material": textual content material
})
# Convert OCR info proper right into a DataFrame
ocr_df = pd.DataFrame(ocr_data)
return ocr_df
# Making use of OCR to preprocessed PDF pages and displaying the first few rows of the result
ocr_df = ocr_pdf(pdf_pages)
ocr_df.head()
For Named Entity Recognition (NER) we use spaCy. spaCy is a strong Python NLP library designed for smart real-word functions. It has quite a few key choices that make it a excellent choice for NLP duties
- Simple Utilization: spaCy is easy to utilize, setting pleasant and may take care of large portions of texts.
- Pre-trained Fashions: It provides quite a few pre-trained fashions to type out diverse NLP duties, along with NER, POS-Tagging or sentiment analysis.
- Evolving Ecosystem: Since its launch 10 years previously, spaCy has created an enormous ecosystem with integrations to many alternative libraries and components, and is steadily rising. Simply recently, spaCy moreover started incorporating Huge Language Model (LLM) capabilities like ChatGPT into their pipelines.
- NER Capabilities: spaCy’s NER perform incorporates pre-trained fashions for identification and extraction of predefined lessons like names, organizations and areas. These capabilities might be discovered out-of-the-box and are already pre-trained on large corpora. The NER model consists of a CNN neural group that makes use of various types of word-embedding choices. Totally different fashions like BERT can nonetheless be fine-tuned to outperform the spaCy NER model.
NER Implementation on OCR textual content material
We use a easy technique to get the entities from the textual content material and map them to the PDF doc
- Textual content material Conversion: First, The OCR output is reworked to plain textual content material. Each phrase acknowledged by OCR is concatenated with areas inserted between them.
- NER Step: For NER we use the ‘en_core_web_md’ spaCy model which provides an excellent effectivity with low cost computational effort. We do not furthermore fine-tune the model.
- Entity Mapping: After the NER step, we now have to trace once more the found entities to the fitting OCR phrases. We supply out this mapping using the char_start and char_end positions of each OCR phrase.
def apply_ner(ocr_df):
"""
Apply Named Entity Recognition (NER) to OCR-processed textual content material info.Parameters:
- ocr_df (pd.DataFrame): DataFrame containing OCR-processed textual content material.
Returns:
- pd.DataFrame: Enhanced DataFrame with NER annotations along with entity types and distinctive identifiers.
"""
# Initialize columns for NER annotations
ocr_df["ner_type"] = "O" # Default type for non-entity
ocr_df["ner_number"] = -1 # Default identifier for non-entity
# Compute character start and end positions for mapping once more entities
ocr_df["char_start"] = np.cumsum(ocr_df.shift(1)["text"].str.len().fillna(0) + 1).astype(int)
ocr_df["char_end"] = ocr_df['char_start'] + ocr_df['text'].str.len() + 1
# Concatenate all OCR textual content material to sort a single doc for spaCy NER
document_text = " ".be a part of(ocr_df["text"])
nlp = spacy.load("en_core_web_md")
doc = nlp(document_text) # Apply NER
# Iterate over acknowledged entities and assign labels and IDs
for ner_number, entity in enumerate(doc.ents):
start_char = entity.start_char
end_char = entity.end_char
# Uncover OCR textual content material indices akin to the entity's character positions
start_idx_candidates = ocr_df.index[(ocr_df["char_start"] <= start_char)]
end_idx_candidates = ocr_df.index[(end_char < ocr_df["char_end"])]
if not start_idx_candidates.empty and by no means end_idx_candidates.empty:
start_idx = start_idx_candidates[-1] # Remaining index meeting the start scenario
end_idx = end_idx_candidates[0] # First index meeting the tip scenario
# Apply NER label and amount to all rows all through the entity's differ
ocr_df.loc[start_idx:end_idx, "ner_type"] = entity.label_
ocr_df.loc[start_idx:end_idx, "ner_number"] = ner_number
return ocr_df
# Apply NER to the OCR DataFrame and present the first few annotated entries
ocr_df_ner = apply_ner(ocr_df)
ocr_df_ner.head()
This technique presents some challenges and tradeoffs:
- Spatial Challenges: Phrases which is perhaps bodily far apart inside the doc appear subsequent to at least one one other inside the ensuing textual content material, which may confuse the NER model that they are collectively. Furthermore, single entities (e.g., dates formatted as “10.2.2024”) might get separated with an space. To boost this, the position information and distances might very nicely be included inside the NER step.
- Language Challenges: Our technique assumes English paperwork and as a result of this truth makes use of the English spaCy model. Language detection is usually a probability to help choose the fitting spaCy language model.
Lastly, we add seen bounding containers to the PDFs to highlight acknowledged entities. We use the PyMuPDF library to load the doc and overlay bounding containers with skinny colored strains over the found entities. Each entity is assigned a novel color.
Some additional steps ought to even be carried out to appropriately add bounding containers.
- Coordinate Scaling: As a result of the genuine coordinates are from the PIL photographs, we now have to scale the coordinates to align with the PyMuPDF format.
- Combine consecutive Bounding Packing containers: For consecutive phrases acknowledged as part of the similar entity inside a line, only one bounding subject with all phrases of the entity is drawn.
- PDF Adjustments: Our technique adjusts for inside geometry changes in some PDFs. This originates inside the completely completely different representations of content material materials all through the PDF.
def overlay_ner_to_pdf(ocr_df_ner, pdf_pages, pdf_input_path, pdf_output_path):
"""
This carry out modifies the distinctive PDF by drawing colored rectangles throughout the textual content material of acknowledged entities.
Each entity type is represented by a novel color for easy identification.Parameters:
- ocr_df_ner (pd.DataFrame): DataFrame containing NER annotations from spaCy.
- pdf_pages (guidelines): Guidelines of photographs representing PDF pages, used for scaling calculations.
- pdf_input_path (str): Path to the distinctive PDF file.
- pdf_output_path (str): Path the place the annotated PDF shall be saved.
"""
# Define colors for numerous entity types
spacy_entity_colors = {
'PERSON': (0.7, 0.25, 0.25), # Darkish purple
'NORP': (0.5, 0.5, 0.25), # Olive
'FAC': (0.5, 0.25, 0.5), # Purple
'ORG': (0.25, 0.5, 0.5), # Teal
'GPE': (0.25, 0.25, 0.75), # Blue
'LOC': (0.25, 0.75, 0.25), # Inexperienced
'PRODUCT': (0.75, 0.25, 0.75), # Magenta
'EVENT': (0.75, 0.75, 0.25), # Yellow
'WORK_OF_ART': (0.5, 0.5, 0.75), # Gentle blue
'LAW': (0.25, 0.75, 0.75), # Cyan
'LANGUAGE': (0.75, 0.5, 0.25), # Orange
'DATE': (0.75, 0.25, 0.25), # Pink
'TIME': (0.25, 0.5, 0.25), # Darkish inexperienced
'PERCENT': (0.5, 0.25, 0.25), # Brown
'MONEY': (0.25, 0.5, 0.75), # Light blue
'QUANTITY': (0.5, 0.75, 0.25), # Lime
'ORDINAL': (0.75, 0.5, 0.75), # Pink
'CARDINAL': (0.25, 0.25, 0.5) # Navy
}
# Load the PDF file using PyMuPDF
pdf_document = fitz.open(pdf_input_path)
# Iterate over each group of entities to overlay bounding containers
for _, group in ocr_df_ner.query("ner_type != 'O'").groupby(["ner_number", "page_num", "block_num", "par_num", "line_num"]):
net web page = pdf_document.load_page(int(group['page_num'].iloc[0]))
net web page.wrap_contents() # Alter for any inside geometry changes inside the PDF
# Scaling elements for coordinates
scale_x, scale_y = net web page.rect.width / pdf_pages[0].width, net web page.rect.prime / pdf_pages[0].prime
# Calculate and scale bounding subject coordinates
x1, y1, x2, y2 = group['left'].iloc[0], group['top'].iloc[0], group['left'].iloc[-1] + group['width'].iloc[-1], group['top'].iloc[-1] + group['height'].iloc[-1]
rect = fitz.Rect(x1 * scale_x, y1 * scale_y, x2 * scale_x, y2 * scale_y)
# Draw the rectangle on the PDF net web page
net web page.draw_rect(rect, color=spacy_entity_colors[group["ner_type"].iloc[0]], width=0.5)
# Save the modified PDF
pdf_document.save(pdf_output_path)
pdf_document.shut()
# Apply entity highlighting to the PDF
overlay_ner_to_pdf(ocr_df_ner, pdf_pages, pdf_input_path, pdf_output_path)
# Present the first net web page of the annotated PDF inside the output folder
pdf_pages = convert_from_path(pdf_output_path, dpi=300)
pdf_pages[0]