Tomek Hyperlinks (T-Hyperlinks) launched by Ivan Tomek in 1976, are pairs of cases from totally different courses which are one another’s nearest neighbors. These pairs are recognized primarily based on a distance metric, usually Euclidean distance. They’re utilized in machine studying, significantly for imbalanced datasets, to enhance classification efficiency by decreasing class overlap and noise.
Tomek Hyperlinks assist determine and take away borderline cases which will trigger confusion through the studying course of. By eliminating these pairs, the dataset turns into cleaner, enhancing the efficiency of classification algorithms.
Tomek Hyperlinks are primarily utilized in imbalanced datasets the place one class considerably outnumbers the opposite. They’re frequent in fraud detection, medical prognosis, and buyer churn prediction, the place it’s essential to cut back false positives and negatives.
Instance
Take into account a binary classification downside with two courses, A and B. If an occasion from class A and an occasion from class B are one another’s nearest neighbors, they kind a Tomek Hyperlink. Eradicating these cases helps make clear the boundary between the 2 courses. In buyer churn prediction, figuring out and eradicating Tomek Hyperlinks can assist enhance the accuracy of predicting whether or not a buyer will depart or keep. By cleansing the dataset, the mannequin can study extra distinct patterns related to buyer conduct.
Course of
- Establish Pairs: Discover pairs of cases which are nearest neighbors however belong to totally different courses.
- Consider Distance: Calculate the space between every pair.
- Take away Situations: Take away the cases that kind the Tomek Hyperlink. This step reduces overlap and enhances class separation.
Tomek Hyperlinks are primarily thought-about an information cleansing technique reasonably than a resampling technique, although they’ll serve a twin function relying on how they’re used.
- Information Cleansing Side : eradicating borderline cases, bettering class separation
- Resampling Side: minority class focus, combining with resampling technique (SMOTE, Random Oversampling, Random Undersampling)
Sensible Use
In follow, Tomek Hyperlinks are sometimes used as a part of a pipeline that features each knowledge cleansing and resampling steps.
# LIBRARY
import matplotlib.pyplot as plt
import numpy as np
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import TomekLinks
from sklearn.datasets import make_classification
from collections import Counter# Create dummy imbalanced dataset
X, y = make_classification(n_samples=500, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1,
weights=[0.9, 0.1], flip_y=0, random_state=1)
# Pipeline with totally different resampling strategies
def apply_resampling(X, y, resampler, title):
pipeline = Pipeline([
('resampler', resampler),
('tl', TomekLinks())
])
X_resampled, y_resampled = pipeline.fit_resample(X, y)
return X_resampled, y_resampled, title
# Visualization
def plot_data(X_datasets, y_datasets, titles):
num_datasets = len(X_datasets)
fig, axes = plt.subplots(num_datasets,1 , figsize=(10, 10))
for i in vary(num_datasets):
X_resampled, y_resampled = X_datasets[i], y_datasets[i]
axes[i].scatter(X_resampled[y_resampled == 0][:, 0], X_resampled[y_resampled == 0][:, 1],
label='Class 0', alpha=0.5, edgecolor='okay', coloration='blue')
axes[i].scatter(X_resampled[y_resampled == 1][:, 0], X_resampled[y_resampled == 1][:, 1],
label='Class 1', alpha=0.5, edgecolor='okay', coloration='pink')
axes[i].set_title(titles[i])
axes[i].legend()
plt.tight_layout()
plt.present()
# Apply resampling strategies
X_datasets = []
y_datasets = []
titles = []
# Authentic Information
X_datasets.append(X)
y_datasets.append(y)
titles.append("Authentic Information")
# RandomUnderSampler + TomekLinks
X_resampled_rus, y_resampled_rus, title_rus = apply_resampling(X, y, RandomUnderSampler(random_state=42), "RandomUnderSampler + TomekLinks")
X_datasets.append(X_resampled_rus)
y_datasets.append(y_resampled_rus)
titles.append(title_rus)
# RandomOverSampler + TomekLinks
X_resampled_ros, y_resampled_ros, title_ros = apply_resampling(X, y, RandomOverSampler(random_state=42), "RandomOverSampler + TomekLinks")
X_datasets.append(X_resampled_ros)
y_datasets.append(y_resampled_ros)
titles.append(title_ros)
# SMOTE + TomekLinks
X_resampled_smote, y_resampled_smote, title_smote = apply_resampling(X, y, SMOTE(random_state=42), "SMOTE + TomekLinks")
X_datasets.append(X_resampled_smote)
y_datasets.append(y_resampled_smote)
titles.append(title_smote)
# Plot the info
plot_data(X_datasets, y_datasets, titles)
# Verify the category distribution after every resampling technique
print("Class Distribution:")
print(f"After Authentic Information: {Counter(y)}")
print(f"After RandomUnderSampler + TomekLinks: {Counter(y_resampled_rus)}")
print(f"After RandomOverSampler + TomekLinks: {Counter(y_resampled_ros)}")
print(f"After SMOTE + TomekLinks: {Counter(y_resampled_smote)}")
Output Comparability: