Tomek Hyperlinks (T-Hyperlinks) launched by Ivan Tomek in 1976, are pairs of circumstances from completely completely different programs that are each other’s nearest neighbors. These pairs are acknowledged based on a distance metric, normally Euclidean distance. They’re utilized in machine learning, considerably for imbalanced datasets, to reinforce classification effectivity by reducing class overlap and noise.
Tomek Hyperlinks help decide and take away borderline circumstances which can set off confusion by means of the learning course of. By eliminating these pairs, the dataset turns into cleaner, enhancing the effectivity of classification algorithms.
Tomek Hyperlinks are primarily utilized in imbalanced datasets the place one class significantly outnumbers the other. They’re frequent in fraud detection, medical prognosis, and purchaser churn prediction, the place it is important to chop again false positives and negatives.
Occasion
Bear in mind a binary classification draw back with two programs, A and B. If an event from class A and an event from class B are each other’s nearest neighbors, they type a Tomek Hyperlink. Eradicating these circumstances helps clarify the boundary between the two programs. In purchaser churn prediction, determining and eradicating Tomek Hyperlinks can help improve the accuracy of predicting whether or not or not a purchaser will depart or hold. By cleaning the dataset, the model can research further distinct patterns associated to purchaser conduct.
Course of
- Set up Pairs: Uncover pairs of circumstances that are nearest neighbors nevertheless belong to completely completely different programs.
- Contemplate Distance: Calculate the area between each pair.
- Take away Conditions: Take away the circumstances that sort the Tomek Hyperlink. This step reduces overlap and enhances class separation.
Tomek Hyperlinks are primarily thought-about an info cleaning method moderately than a resampling method, though they will serve a twin operate counting on how they’re used.
- Data Cleaning Facet : eradicating borderline circumstances, bettering class separation
- Resampling Facet: minority class focus, combining with resampling method (SMOTE, Random Oversampling, Random Undersampling)
Smart Use
In observe, Tomek Hyperlinks are typically used as part of a pipeline that options every information cleaning and resampling steps.
# LIBRARY
import matplotlib.pyplot as plt
import numpy as np
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import TomekLinks
from sklearn.datasets import make_classification
from collections import Counter# Create dummy imbalanced dataset
X, y = make_classification(n_samples=500, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1,
weights=[0.9, 0.1], flip_y=0, random_state=1)
# Pipeline with completely completely different resampling methods
def apply_resampling(X, y, resampler, title):
pipeline = Pipeline([
('resampler', resampler),
('tl', TomekLinks())
])
X_resampled, y_resampled = pipeline.fit_resample(X, y)
return X_resampled, y_resampled, title
# Visualization
def plot_data(X_datasets, y_datasets, titles):
num_datasets = len(X_datasets)
fig, axes = plt.subplots(num_datasets,1 , figsize=(10, 10))
for i in differ(num_datasets):
X_resampled, y_resampled = X_datasets[i], y_datasets[i]
axes[i].scatter(X_resampled[y_resampled == 0][:, 0], X_resampled[y_resampled == 0][:, 1],
label='Class 0', alpha=0.5, edgecolor='okay', coloration='blue')
axes[i].scatter(X_resampled[y_resampled == 1][:, 0], X_resampled[y_resampled == 1][:, 1],
label='Class 1', alpha=0.5, edgecolor='okay', coloration='pink')
axes[i].set_title(titles[i])
axes[i].legend()
plt.tight_layout()
plt.current()
# Apply resampling methods
X_datasets = []
y_datasets = []
titles = []
# Genuine Data
X_datasets.append(X)
y_datasets.append(y)
titles.append("Genuine Data")
# RandomUnderSampler + TomekLinks
X_resampled_rus, y_resampled_rus, title_rus = apply_resampling(X, y, RandomUnderSampler(random_state=42), "RandomUnderSampler + TomekLinks")
X_datasets.append(X_resampled_rus)
y_datasets.append(y_resampled_rus)
titles.append(title_rus)
# RandomOverSampler + TomekLinks
X_resampled_ros, y_resampled_ros, title_ros = apply_resampling(X, y, RandomOverSampler(random_state=42), "RandomOverSampler + TomekLinks")
X_datasets.append(X_resampled_ros)
y_datasets.append(y_resampled_ros)
titles.append(title_ros)
# SMOTE + TomekLinks
X_resampled_smote, y_resampled_smote, title_smote = apply_resampling(X, y, SMOTE(random_state=42), "SMOTE + TomekLinks")
X_datasets.append(X_resampled_smote)
y_datasets.append(y_resampled_smote)
titles.append(title_smote)
# Plot the information
plot_data(X_datasets, y_datasets, titles)
# Confirm the class distribution after each resampling method
print("Class Distribution:")
print(f"After Genuine Data: {Counter(y)}")
print(f"After RandomUnderSampler + TomekLinks: {Counter(y_resampled_rus)}")
print(f"After RandomOverSampler + TomekLinks: {Counter(y_resampled_ros)}")
print(f"After SMOTE + TomekLinks: {Counter(y_resampled_smote)}")
Output Comparability: