One other helpful library possibility for pre-processing is OpenCV, because it gives an expanded set of capabilities, together with superior manipulations like resizing, deskewing, and rotating. We are going to nonetheless not dig deeper into OpenCV right here.
On this strategy, we use Tesseract with ‘pytesseract’ for OCR on the PDF photos. Tesseract’s options embrace:
- Multilingual Assist: It helps a number of languages which permits it to course of a various vary of paperwork.
- Open-Supply Benefit: As an open-source instrument, Tesseract is a cheap answer for OCR with good high quality.
- Customization Alternatives: Tesseract’s engine might be fine-tuned for particular paperwork or distinctive fonts to enhance efficiency.
Whereas Tesseract gives a stable baseline, cloud-based companies like Azure Doc Intelligence supply another with doubtlessly higher efficiency however might be expensive.
def ocr_pdf(pdf_pages):
"""
Carry out OCR on every web page of a PDF represented as photos, extracting textual content and metadata.This perform iterates over every web page, applies OCR to extract textual content together with its
positional metadata, and aggregates the outcomes right into a structured pandas DataFrame.
Parameters:
- pdf_pages (checklist of PIL.Picture.Picture): Checklist of PDF pages transformed into photos.
Returns:
- pd.DataFrame: A DataFrame containing OCR extracted textual content and metadata for every acknowledged component.
"""
ocr_data = [] # Initialize a listing to carry OCR outcomes
# Course of every web page for OCR
for page_num, web page in tqdm(enumerate(pdf_pages), desc="Processing OCR"):
# Apply OCR to extract detailed info as a dictionary
ocr_output = pytesseract.image_to_data(web page, output_type='dict')
# Iterate via OCR output, filtering out empty textual content parts
for i, textual content in enumerate(ocr_output['text']):
if textual content.strip(): # Guarantee textual content shouldn't be empty
# Append OCR information with metadata to the checklist
ocr_data.append({
"stage": ocr_output['level'][i],
"page_num": page_num,
"block_num": ocr_output['block_num'][i],
"par_num": ocr_output['par_num'][i],
"line_num": ocr_output['line_num'][i],
"word_num": ocr_output['word_num'][i],
"left": ocr_output['left'][i],
"high": ocr_output['top'][i],
"width": ocr_output['width'][i],
"top": ocr_output['height'][i],
"conf": ocr_output['conf'][i],
"textual content": textual content
})
# Convert OCR information right into a DataFrame
ocr_df = pd.DataFrame(ocr_data)
return ocr_df
# Making use of OCR to preprocessed PDF pages and displaying the primary few rows of the outcome
ocr_df = ocr_pdf(pdf_pages)
ocr_df.head()
For Named Entity Recognition (NER) we use spaCy. spaCy is a robust Python NLP library designed for sensible real-word purposes. It has a number of key options that make it a outstanding selection for NLP duties
- Easy Utilization: spaCy is simple to make use of, environment friendly and might deal with giant quantities of texts.
- Pre-trained Fashions: It gives a number of pre-trained fashions to sort out varied NLP duties, together with NER, POS-Tagging or sentiment evaluation.
- Evolving Ecosystem: Since its launch 10 years in the past, spaCy has created a big ecosystem with integrations to many different libraries and parts, and is steadily increasing. Just lately, spaCy additionally began incorporating Massive Language Mannequin (LLM) capabilities like ChatGPT into their pipelines.
- NER Capabilities: spaCy’s NER function contains pre-trained fashions for identification and extraction of predefined classes like names, organizations and areas. These capabilities can be found out-of-the-box and are already pre-trained on giant corpora. The NER mannequin consists of a CNN neural community that makes use of varied sorts of word-embedding options. Different fashions like BERT can nonetheless be fine-tuned to outperform the spaCy NER mannequin.
NER Implementation on OCR textual content
We use a simple strategy to get the entities from the textual content and map them to the PDF doc
- Textual content Conversion: First, The OCR output is transformed to plain textual content. Every phrase recognized by OCR is concatenated with areas inserted between them.
- NER Step: For NER we use the ‘en_core_web_md’ spaCy mannequin which supplies a very good efficiency with cheap computational effort. We don’t moreover fine-tune the mannequin.
- Entity Mapping: After the NER step, we have to hint again the discovered entities to the right OCR phrases. We carry out this mapping utilizing the char_start and char_end positions of every OCR phrase.
def apply_ner(ocr_df):
"""
Apply Named Entity Recognition (NER) to OCR-processed textual content information.Parameters:
- ocr_df (pd.DataFrame): DataFrame containing OCR-processed textual content.
Returns:
- pd.DataFrame: Enhanced DataFrame with NER annotations together with entity sorts and distinctive identifiers.
"""
# Initialize columns for NER annotations
ocr_df["ner_type"] = "O" # Default sort for non-entity
ocr_df["ner_number"] = -1 # Default identifier for non-entity
# Compute character begin and finish positions for mapping again entities
ocr_df["char_start"] = np.cumsum(ocr_df.shift(1)["text"].str.len().fillna(0) + 1).astype(int)
ocr_df["char_end"] = ocr_df['char_start'] + ocr_df['text'].str.len() + 1
# Concatenate all OCR textual content to type a single doc for spaCy NER
document_text = " ".be part of(ocr_df["text"])
nlp = spacy.load("en_core_web_md")
doc = nlp(document_text) # Apply NER
# Iterate over acknowledged entities and assign labels and IDs
for ner_number, entity in enumerate(doc.ents):
start_char = entity.start_char
end_char = entity.end_char
# Discover OCR textual content indices akin to the entity's character positions
start_idx_candidates = ocr_df.index[(ocr_df["char_start"] <= start_char)]
end_idx_candidates = ocr_df.index[(end_char < ocr_df["char_end"])]
if not start_idx_candidates.empty and never end_idx_candidates.empty:
start_idx = start_idx_candidates[-1] # Final index assembly the beginning situation
end_idx = end_idx_candidates[0] # First index assembly the tip situation
# Apply NER label and quantity to all rows throughout the entity's vary
ocr_df.loc[start_idx:end_idx, "ner_type"] = entity.label_
ocr_df.loc[start_idx:end_idx, "ner_number"] = ner_number
return ocr_df
# Apply NER to the OCR DataFrame and show the primary few annotated entries
ocr_df_ner = apply_ner(ocr_df)
ocr_df_ner.head()
This strategy presents some challenges and tradeoffs:
- Spatial Challenges: Phrases which might be bodily far aside within the doc seem subsequent to one another within the ensuing textual content, which could confuse the NER mannequin that they’re collectively. Moreover, single entities (e.g., dates formatted as “10.2.2024”) may get separated with an area. To enhance this, the placement info and distances may very well be included within the NER step.
- Language Challenges: Our strategy assumes English paperwork and due to this fact makes use of the English spaCy mannequin. Language detection can be a chance to assist select the right spaCy language mannequin.
Lastly, we add visible bounding containers to the PDFs to spotlight acknowledged entities. We use the PyMuPDF library to load the doc and overlay bounding containers with skinny coloured strains over the discovered entities. Every entity is assigned a novel colour.
Some extra steps should even be carried out to appropriately add bounding containers.
- Coordinate Scaling: Because the authentic coordinates are from the PIL photos, we have to scale the coordinates to align with the PyMuPDF format.
- Mix consecutive Bounding Packing containers: For consecutive phrases recognized as a part of the identical entity inside a line, just one bounding field with all phrases of the entity is drawn.
- PDF Changes: Our strategy adjusts for inner geometry adjustments in some PDFs. This originates within the totally different representations of content material throughout the PDF.
def overlay_ner_to_pdf(ocr_df_ner, pdf_pages, pdf_input_path, pdf_output_path):
"""
This perform modifies the unique PDF by drawing coloured rectangles across the textual content of acknowledged entities.
Every entity sort is represented by a novel colour for simple identification.Parameters:
- ocr_df_ner (pd.DataFrame): DataFrame containing NER annotations from spaCy.
- pdf_pages (checklist): Checklist of photos representing PDF pages, used for scaling calculations.
- pdf_input_path (str): Path to the unique PDF file.
- pdf_output_path (str): Path the place the annotated PDF shall be saved.
"""
# Outline colours for various entity sorts
spacy_entity_colors = {
'PERSON': (0.7, 0.25, 0.25), # Darkish purple
'NORP': (0.5, 0.5, 0.25), # Olive
'FAC': (0.5, 0.25, 0.5), # Purple
'ORG': (0.25, 0.5, 0.5), # Teal
'GPE': (0.25, 0.25, 0.75), # Blue
'LOC': (0.25, 0.75, 0.25), # Inexperienced
'PRODUCT': (0.75, 0.25, 0.75), # Magenta
'EVENT': (0.75, 0.75, 0.25), # Yellow
'WORK_OF_ART': (0.5, 0.5, 0.75), # Mild blue
'LAW': (0.25, 0.75, 0.75), # Cyan
'LANGUAGE': (0.75, 0.5, 0.25), # Orange
'DATE': (0.75, 0.25, 0.25), # Pink
'TIME': (0.25, 0.5, 0.25), # Darkish inexperienced
'PERCENT': (0.5, 0.25, 0.25), # Brown
'MONEY': (0.25, 0.5, 0.75), # Gentle blue
'QUANTITY': (0.5, 0.75, 0.25), # Lime
'ORDINAL': (0.75, 0.5, 0.75), # Pink
'CARDINAL': (0.25, 0.25, 0.5) # Navy
}
# Load the PDF file utilizing PyMuPDF
pdf_document = fitz.open(pdf_input_path)
# Iterate over every group of entities to overlay bounding containers
for _, group in ocr_df_ner.question("ner_type != 'O'").groupby(["ner_number", "page_num", "block_num", "par_num", "line_num"]):
web page = pdf_document.load_page(int(group['page_num'].iloc[0]))
web page.wrap_contents() # Alter for any inner geometry adjustments within the PDF
# Scaling components for coordinates
scale_x, scale_y = web page.rect.width / pdf_pages[0].width, web page.rect.top / pdf_pages[0].top
# Calculate and scale bounding field coordinates
x1, y1, x2, y2 = group['left'].iloc[0], group['top'].iloc[0], group['left'].iloc[-1] + group['width'].iloc[-1], group['top'].iloc[-1] + group['height'].iloc[-1]
rect = fitz.Rect(x1 * scale_x, y1 * scale_y, x2 * scale_x, y2 * scale_y)
# Draw the rectangle on the PDF web page
web page.draw_rect(rect, colour=spacy_entity_colors[group["ner_type"].iloc[0]], width=0.5)
# Save the modified PDF
pdf_document.save(pdf_output_path)
pdf_document.shut()
# Apply entity highlighting to the PDF
overlay_ner_to_pdf(ocr_df_ner, pdf_pages, pdf_input_path, pdf_output_path)
# Show the primary web page of the annotated PDF within the output folder
pdf_pages = convert_from_path(pdf_output_path, dpi=300)
pdf_pages[0]