From Chaos to Clarity: Unveiling the Power of Tomek Links | by Annisa Ditasila | Jul, 2024

Tomek Hyperlinks (T-Hyperlinks) launched by Ivan Tomek in 1976, are pairs of circumstances from completely completely different programs that are each other’s nearest neighbors. These pairs are acknowledged based on a distance metric, normally Euclidean distance. They’re utilized in machine learning, considerably for imbalanced datasets, to reinforce classification effectivity by reducing class overlap and noise.

Tomek Hyperlinks help decide and take away borderline circumstances which can set off confusion by means of the learning course of. By eliminating these pairs, the dataset turns into cleaner, enhancing the effectivity of classification algorithms.

Tomek Hyperlinks are primarily utilized in imbalanced datasets the place one class significantly outnumbers the other. They’re frequent in fraud detection, medical prognosis, and purchaser churn prediction, the place it is important to chop again false positives and negatives.

Occasion

Bear in mind a binary classification draw back with two programs, A and B. If an event from class A and an event from class B are each other’s nearest neighbors, they type a Tomek Hyperlink. Eradicating these circumstances helps clarify the boundary between the two programs. In purchaser churn prediction, determining and eradicating Tomek Hyperlinks can help improve the accuracy of predicting whether or not or not a purchaser will depart or hold. By cleaning the dataset, the model can research further distinct patterns associated to purchaser conduct.

Course of

Set up Pairs: Uncover pairs of circumstances that are nearest neighbors nevertheless belong to completely completely different programs.
Contemplate Distance: Calculate the area between each pair.
Take away Conditions: Take away the circumstances that sort the Tomek Hyperlink. This step reduces overlap and enhances class separation.

Tomek Hyperlinks are primarily thought-about an info cleaning method moderately than a resampling method, though they will serve a twin operate counting on how they’re used.

Data Cleaning Facet : eradicating borderline circumstances, bettering class separation
Resampling Facet: minority class focus, combining with resampling method (SMOTE, Random Oversampling, Random Undersampling)

Smart Use

In observe, Tomek Hyperlinks are typically used as part of a pipeline that options every information cleaning and resampling steps.

# LIBRARY
import matplotlib.pyplot as plt
import numpy as np
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import TomekLinks
from sklearn.datasets import make_classification
from collections import Counter# Create dummy imbalanced dataset
X, y = make_classification(n_samples=500, n_features=2, n_informative=2, 
n_redundant=0, n_clusters_per_class=1,
weights=[0.9, 0.1], flip_y=0, random_state=1)
# Pipeline with completely completely different resampling methods
def apply_resampling(X, y, resampler, title):
pipeline = Pipeline([
('resampler', resampler),
('tl', TomekLinks())
])
X_resampled, y_resampled = pipeline.fit_resample(X, y)
return X_resampled, y_resampled, title
# Visualization 
def plot_data(X_datasets, y_datasets, titles):
num_datasets = len(X_datasets)
fig, axes = plt.subplots(num_datasets,1 , figsize=(10, 10))
for i in differ(num_datasets):
X_resampled, y_resampled = X_datasets[i], y_datasets[i]
axes[i].scatter(X_resampled[y_resampled == 0][:, 0], X_resampled[y_resampled == 0][:, 1], 
label='Class 0', alpha=0.5, edgecolor='okay', coloration='blue')
axes[i].scatter(X_resampled[y_resampled == 1][:, 0], X_resampled[y_resampled == 1][:, 1], 
label='Class 1', alpha=0.5, edgecolor='okay', coloration='pink')
axes[i].set_title(titles[i])
axes[i].legend()
plt.tight_layout()
plt.current()
# Apply resampling methods
X_datasets = []
y_datasets = []
titles = []
# Genuine Data
X_datasets.append(X)
y_datasets.append(y)
titles.append("Genuine Data")
# RandomUnderSampler + TomekLinks
X_resampled_rus, y_resampled_rus, title_rus = apply_resampling(X, y, RandomUnderSampler(random_state=42), "RandomUnderSampler + TomekLinks")
X_datasets.append(X_resampled_rus)
y_datasets.append(y_resampled_rus)
titles.append(title_rus)
# RandomOverSampler + TomekLinks
X_resampled_ros, y_resampled_ros, title_ros = apply_resampling(X, y, RandomOverSampler(random_state=42), "RandomOverSampler + TomekLinks")
X_datasets.append(X_resampled_ros)
y_datasets.append(y_resampled_ros)
titles.append(title_ros)
# SMOTE + TomekLinks
X_resampled_smote, y_resampled_smote, title_smote = apply_resampling(X, y, SMOTE(random_state=42), "SMOTE + TomekLinks")
X_datasets.append(X_resampled_smote)
y_datasets.append(y_resampled_smote)
titles.append(title_smote)
# Plot the information
plot_data(X_datasets, y_datasets, titles)
# Confirm the class distribution after each resampling method
print("Class Distribution:")
print(f"After Genuine Data: {Counter(y)}")
print(f"After RandomUnderSampler + TomekLinks: {Counter(y_resampled_rus)}")
print(f"After RandomOverSampler + TomekLinks: {Counter(y_resampled_ros)}")
print(f"After SMOTE + TomekLinks: {Counter(y_resampled_smote)}")

Output Comparability:

Source link

From Chaos to Clarity: Unveiling the Power of Tomek Links | by Annisa Ditasila | Jul, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Denodo Platform 9.1 Brings New Advanced AI Capabilities and Enhanced Data Lakehouse Performance

Harnessing AI in Agriculture – insideAI News

How Big Data Is Transforming Patient Care Delivery

How to Assist Human Agents & Transform Customer Experience with Conversational AI?

Salesforce Introduces Agentforce Testing Center: AI Agent Lifecycle Management Tooling for Testing Autonomous AI Agents at Scale

Our Picks

Mastering Data Formats: A Quick Guide to Reading CSV, JSON, XML, HTML, SQL, and Parquet Files | by Ankit Gupta | Jul, 2024

Comprehensive Guide to Naive Bayes Classifier | by Jainvidip | Jun, 2024

What is Odoo ERP Software?

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

From Chaos to Clarity: Unveiling the Power of Tomek Links | by Annisa Ditasila | Jul, 2024

Occasion

Course of

Smart Use

Related Posts