Within the digital age, monetary transactions happen at lightning pace, and detecting anomalies in these transactions is essential for stopping fraud and guaranteeing safety. This text delves into the world of anomaly detection, showcasing how we are able to leverage machine studying fashions like Isolation Forest and Autoencoder to uncover hidden anomalies in transaction information.
Anomaly detection is the method of figuring out uncommon patterns that don’t conform to anticipated habits. Within the context of economic transactions, anomalies might point out fraudulent actions, errors, or different irregularities that want consideration.
Dataset: Github Link
For this challenge, we used a dataset containing transaction particulars, together with transaction quantities, account sorts, and different related options. Step one was to load and discover the dataset to know its construction and contents.
# Importing Libraries
import pandas as pd
import plotly.specific as px
import matplotlib.pyplot as plt
import numpy as np
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import IsolationForest
from tensorflow.keras.fashions import Mannequin
from tensorflow.keras.layers import Enter, Dense
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, precision_recall_curve
information = pd.read_csv("transaction_anomalies_dataset.csv")
information.head()
In search of null values in dataset
information.isnull().sum()
Now, let’s search for the column insights
information.information()
Exploratory information evaluation (EDA) is significant to know the distributions and relationships between variables. Visualizations helped in figuring out patterns and potential anomalies.
# Distribution of Transaction (Histogram)
dist_transaction = px.histogram(information, x = 'Transaction_Amount', nbins = 20, title = 'Distribution of Transaction Quantity')
dist_transaction.update_layout(width = 1000, peak = 600, xaxis_title = 'Transaction Quantity', yaxis_title = 'Frequency')
dist_transaction.present()
Let’s take a look on the distribution of transactions quantity by account kind
# Transaction Quantity by Account kind (Field plot)
transaction_acc_type = px.field(information, x = 'Account_Type', y = 'Transaction_Amount', title = 'Transaction Quantity by Account Kind')
transaction_acc_type.update_layout(width = 1000, peak = 600, xaxis_title = 'Transaction Quantity', yaxis_title = 'Account Kind')
transaction_acc_type.present()
Now visualizing Common Transaction Quantity vs. Age, and search for a pattern line
This code creates a bar chart utilizing Plotly Specific to visualise transaction frequencies by the day of the week
# Rely of Transaction by Day of the Week (Bar chart)
day_of_week = px.bar(information, x = 'Day_of_Week', title = 'Frequency of Transactions by Day of the Week')
day_of_week.update_layout(width = 1000, peak = 600, yaxis_title = 'Frequency', xaxis_title = 'Day of Week')
day_of_week.present()
This graphical illustration helps determine relationships between variables, essential for understanding information dynamics in anomaly detection and past
# Heatmap of options
correlation_matrix = numeric_data.corr()
fig_corr_heatmap = go.Determine(information = go.Heatmap(z = correlation_matrix.values, x = correlation_matrix.columns, y = correlation_matrix.index))
fig_corr_heatmap.update_layout(title = 'Correlation Heatmap', peak = 600)
fig_corr_heatmap.present()
A easy but efficient technique is utilizing the Z-score to detect anomalies. We calculated the imply and customary deviation of transaction quantities to determine outliers.
# Calculate imply and customary deviation of Transaction Quantity
mean_amount = information['Transaction_Amount'].imply()
std_amount = information['Transaction_Amount'].std()# Outline the anomaly threshold (2 customary deviations from the imply)
anomaly_threshold = mean_amount + 2 * std_amount
# Flag anomalies
information['Is_Anomaly'] = information['Transaction_Amount'] > anomaly_threshold
# Scatter plot of Transaction Quantity with anomalies highlighted
anomalies = px.scatter(information, x = 'Transaction_Amount', y = 'Average_Transaction_Amount',coloration = 'Is_Anomaly',
title = 'Anomalies in Transaction Quantity')
anomalies.update_traces(marker = dict(measurement = 12), selector = dict(mode = 'markers', marker_size = 1))
anomalies.update_layout(peak = 600, xaxis_title = 'Transaction Quantity', yaxis_title = 'Common Transaction Quantity')
anomalies.present()
Isolation Forest
Isolation Forest is an unsupervised studying algorithm designed for anomaly detection. It really works by isolating observations in a tree construction, the place anomalies require fewer splits to isolate.
# Calculate the variety of anomalies
num_anomalies = information['Is_Anomaly'].sum()# Calculate the whole variety of situations within the dataset
total_instances = information.form[0]
# Calculate the ratio of anomalies
anomaly_ratio = num_anomalies / total_instances
print(anomaly_ratio)
Traning mannequin for Isolation Forest
# Isolation forest
relevant_features = ['Transaction_Amount', 'Average_Transaction_Amount', 'Frequency_of_Transactions']# Break up information into options (X) and goal variable (y)
X = information[relevant_features]
y = information['Is_Anomaly']
# Break up information into prepare and check units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
# Practice the Isolation Forest mannequin
mannequin = IsolationForest(contamination = 0.02, random_state = 42)
mannequin.match(X_train)
# Predict anomalies on the check set
y_pred = mannequin.predict(X_test)# Convert predictions to binary values (0: regular, 1: anomaly)
y_pred_binary = [1 if pred == -1 else 0 for pred in y_pred]
# Consider the mannequin's efficiency
report = classification_report(y_test, y_pred_binary, target_names=['Normal', 'Anomaly'])
print(report)
Lastly, right here’s how we are able to use our educated mannequin to detect anomalies:
# Related options used throughout coaching
relevant_features = ['Transaction_Amount', 'Average_Transaction_Amount', 'Frequency_of_Transactions']# Get consumer inputs for options
user_inputs = []
for function in relevant_features:
user_input = float(enter(f"Enter the worth for '{function}': "))
user_inputs.append(user_input)
# Create a DataFrame from consumer inputs
user_df = pd.DataFrame([user_inputs], columns=relevant_features)
# Predict anomalies utilizing the mannequin
user_anomaly_pred = mannequin.predict(user_df)
# Convert the prediction to binary worth (0: regular, 1: anomaly)
user_anomaly_pred_binary = 1 if user_anomaly_pred == -1 else 0
if user_anomaly_pred_binary == 1:
print("Anomaly detected: This transaction is flagged as an anomaly.")
else:
print("No anomaly detected: This transaction is regular.")
Random Forest Classifier
One other machine studying fashions we explored was the Random Forest Classifier. This mannequin is especially well-suited for classification duties and gives beneficial insights by means of its function significance metrics.
# Assuming 'X' is your function matrix and 'y' is your goal
mannequin = RandomForestClassifier()
mannequin.match(X_train, y_train)
feature_importance = pd.DataFrame({'Function': X.columns,
'Significance': mannequin.feature_importances_}).sort_values(by = 'Significance', ascending=False)
fig = px.bar(feature_importance, x = 'Significance', y = 'Function', title = 'Function Significance')
fig.present()
Autoencoder
An Autoencoder is a kind of neural community that learns a compressed illustration of the info. It reconstructs the enter information, and important reconstruction errors point out anomalies.
numeric_data = information.select_dtypes(embrace=['number'])scaler = StandardScaler()
numeric_data_scaled = scaler.fit_transform(numeric_data)
# Break up information into coaching and testing units
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(numeric_data_scaled, test_size=0.2, random_state=42)
# Construct the autoencoder mannequin
input_dim = X_train.form[1]
input_layer = Enter(form=(input_dim,))
encoder = Dense(14, activation = "relu")(input_layer)
encoder = Dense(7, activation = "relu")(encoder)
decoder = Dense(14, activation = "relu")(encoder)
decoder = Dense(input_dim, activation = "sigmoid")(decoder)
autoencoder = Mannequin(inputs=input_layer, outputs=decoder)
autoencoder.compile(optimizer = 'adam', loss = 'mean_squared_error')
# Practice the autoencoder
historical past = autoencoder.match(X_train, X_train, epochs = 50, batch_size = 32, validation_data = (X_test, X_test), verbose = 1)
Anomaly Detection
plt.plot(historical past.historical past['loss'], label='Coaching Loss')
plt.plot(historical past.historical past['val_loss'], label='Validation Loss')
plt.legend()
plt.present()# Predict the reconstruction on the check information
X_test_pred = autoencoder.predict(X_test)
mse = np.imply(np.energy(X_test - X_test_pred, 2), axis = 1)
# Calculate the reconstruction error threshold
threshold = np.percentile(mse, 95)
# Determine anomalies
anomalies = mse > threshold
# Plot the reconstruction error
plt.hist(mse, bins = 50)
plt.axvline(threshold, coloration = 'r', linestyle = '--')
plt.xlabel('Reconstruction Error')
plt.ylabel('Variety of Samples')
plt.present()
# Print variety of anomalies detected
print(f'Variety of anomalies detected: {np.sum(anomalies)}')
Variety of anomalies detected: 10
Evaluating the efficiency of anomaly detection fashions entails metrics like precision, recall, F1-score, and ROC-AUC. These metrics assist in understanding how nicely the mannequin distinguishes between regular and anomalous transactions.
# Isolation Forest analysis
roc_score_if = roc_auc_score(y_test, y_pred_binary)
precision_if, recall_if, _ = precision_recall_curve(y_test, y_pred_binary)print(f"Isolation Forest ROC-AUC: {roc_score_if}")
# Autoencoder analysis (requires reconstruction error calculation)
reconstruction_error = autoencoder.predict(X_test)
mse = np.imply(np.energy(X_test - reconstruction_error, 2), axis=1)
threshold = np.percentile(mse, 95)
y_pred_autoencoder = mse > threshold
roc_score_ae = roc_auc_score(y_test, y_pred_autoencoder)
precision_ae, recall_ae, _ = precision_recall_curve(y_test, y_pred_autoencoder)
print(f"Autoencoder ROC-AUC: {roc_score_ae}")
# Output (Analysis of ML Fashions)
Isolation Forest ROC-AUC: 1.0
7/7 ━━━━━━━━━━━━━━━━━━━━ 0s 461us/step
Autoencoder ROC-AUC: 0.9846938775510203
On this challenge, we explored totally different strategies for anomaly detection in transaction information. Each the Isolation Forest and Autoencoder fashions demonstrated their strengths in figuring out anomalies. The Isolation Forest is especially efficient for its simplicity and interpretability, whereas the Autoencoder shines in capturing advanced patterns and relationships within the information.
Selecting the best mannequin is determined by the precise necessities and constraints of your utility. For a steadiness between simplicity and efficiency, the Isolation Forest is a good alternative. In case your information has advanced patterns, think about using an Autoencoder.