From Chaos to Clarity: Unveiling the Power of Tomek Links | by Annisa Ditasila | Jul, 2024

Tomek Hyperlinks (T-Hyperlinks) launched by Ivan Tomek in 1976, are pairs of cases from totally different courses which are one another’s nearest neighbors. These pairs are recognized primarily based on a distance metric, usually Euclidean distance. They’re utilized in machine studying, significantly for imbalanced datasets, to enhance classification efficiency by decreasing class overlap and noise.

Tomek Hyperlinks assist determine and take away borderline cases which will trigger confusion through the studying course of. By eliminating these pairs, the dataset turns into cleaner, enhancing the efficiency of classification algorithms.

Tomek Hyperlinks are primarily utilized in imbalanced datasets the place one class considerably outnumbers the opposite. They’re frequent in fraud detection, medical prognosis, and buyer churn prediction, the place it’s essential to cut back false positives and negatives.

Instance

Take into account a binary classification downside with two courses, A and B. If an occasion from class A and an occasion from class B are one another’s nearest neighbors, they kind a Tomek Hyperlink. Eradicating these cases helps make clear the boundary between the 2 courses. In buyer churn prediction, figuring out and eradicating Tomek Hyperlinks can assist enhance the accuracy of predicting whether or not a buyer will depart or keep. By cleansing the dataset, the mannequin can study extra distinct patterns related to buyer conduct.

Course of

Establish Pairs: Discover pairs of cases which are nearest neighbors however belong to totally different courses.
Consider Distance: Calculate the space between every pair.
Take away Situations: Take away the cases that kind the Tomek Hyperlink. This step reduces overlap and enhances class separation.

Tomek Hyperlinks are primarily thought-about an information cleansing technique reasonably than a resampling technique, although they’ll serve a twin function relying on how they’re used.

Information Cleansing Side : eradicating borderline cases, bettering class separation
Resampling Side: minority class focus, combining with resampling technique (SMOTE, Random Oversampling, Random Undersampling)

Sensible Use

In follow, Tomek Hyperlinks are sometimes used as a part of a pipeline that features each knowledge cleansing and resampling steps.

# LIBRARY
import matplotlib.pyplot as plt
import numpy as np
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import TomekLinks
from sklearn.datasets import make_classification
from collections import Counter# Create dummy imbalanced dataset
X, y = make_classification(n_samples=500, n_features=2, n_informative=2, 
n_redundant=0, n_clusters_per_class=1,
weights=[0.9, 0.1], flip_y=0, random_state=1)
# Pipeline with totally different resampling strategies
def apply_resampling(X, y, resampler, title):
pipeline = Pipeline([
('resampler', resampler),
('tl', TomekLinks())
])
X_resampled, y_resampled = pipeline.fit_resample(X, y)
return X_resampled, y_resampled, title
# Visualization 
def plot_data(X_datasets, y_datasets, titles):
num_datasets = len(X_datasets)
fig, axes = plt.subplots(num_datasets,1 , figsize=(10, 10))
for i in vary(num_datasets):
X_resampled, y_resampled = X_datasets[i], y_datasets[i]
axes[i].scatter(X_resampled[y_resampled == 0][:, 0], X_resampled[y_resampled == 0][:, 1], 
label='Class 0', alpha=0.5, edgecolor='okay', coloration='blue')
axes[i].scatter(X_resampled[y_resampled == 1][:, 0], X_resampled[y_resampled == 1][:, 1], 
label='Class 1', alpha=0.5, edgecolor='okay', coloration='pink')
axes[i].set_title(titles[i])
axes[i].legend()
plt.tight_layout()
plt.present()
# Apply resampling strategies
X_datasets = []
y_datasets = []
titles = []
# Authentic Information
X_datasets.append(X)
y_datasets.append(y)
titles.append("Authentic Information")
# RandomUnderSampler + TomekLinks
X_resampled_rus, y_resampled_rus, title_rus = apply_resampling(X, y, RandomUnderSampler(random_state=42), "RandomUnderSampler + TomekLinks")
X_datasets.append(X_resampled_rus)
y_datasets.append(y_resampled_rus)
titles.append(title_rus)
# RandomOverSampler + TomekLinks
X_resampled_ros, y_resampled_ros, title_ros = apply_resampling(X, y, RandomOverSampler(random_state=42), "RandomOverSampler + TomekLinks")
X_datasets.append(X_resampled_ros)
y_datasets.append(y_resampled_ros)
titles.append(title_ros)
# SMOTE + TomekLinks
X_resampled_smote, y_resampled_smote, title_smote = apply_resampling(X, y, SMOTE(random_state=42), "SMOTE + TomekLinks")
X_datasets.append(X_resampled_smote)
y_datasets.append(y_resampled_smote)
titles.append(title_smote)
# Plot the info
plot_data(X_datasets, y_datasets, titles)
# Verify the category distribution after every resampling technique
print("Class Distribution:")
print(f"After Authentic Information: {Counter(y)}")
print(f"After RandomUnderSampler + TomekLinks: {Counter(y_resampled_rus)}")
print(f"After RandomOverSampler + TomekLinks: {Counter(y_resampled_ros)}")
print(f"After SMOTE + TomekLinks: {Counter(y_resampled_smote)}")

Output Comparability:

Source link

From Chaos to Clarity: Unveiling the Power of Tomek Links | by Annisa Ditasila | Jul, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Quibim: $50M Series A for Precision Medicine with AI-Powered Imaging Biomarkers

Bluwhale Secures $100M for Web3 Layer across L1 and L2 Blockchains

Redefining Education With Personalized Learning Powered by AI

AI Everywhere: Empowerment or Entrapment?

Bodo.ai Open-Sources HPC Python Compute Engine

Our Picks

Understanding GPU Usage — Enhancing High-Performance Computing | by Brandon Kang | Jun, 2024

Recent Developments around Attention Networks part1 | by Monodeep Mukherjee | May, 2024

Vero AI Publishes Report Identifying Trends in AI and Data Privacy Legislation, Supporting Enterprises in Compliance Planning

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

From Chaos to Clarity: Unveiling the Power of Tomek Links | by Annisa Ditasila | Jul, 2024

Instance

Course of

Sensible Use

Related Posts