Building an Effective Fraud Detection Credit Card Model with Logistic Regression | by Hakki Hakkari | Apr, 2024

The world has modified, and so have our strategies of dealing with cash. With refined expertise, we will withdraw or transact cash effectively, flexibly, and easily. Nevertheless, this comfort additionally brings challenges. Some reviews have indicated that there’s an elevated threat of fraud. As transactions grow to be extra versatile, in addition they grow to be extra enticing targets for cybercriminals.

Fraudulent assaults are more and more specializing in people with bank cards. These criminals make use of quite a lot of ways to deceive unsuspecting cardholders and siphon off funds. The rise in such schemes emphasizes the necessity for improved safety measures and elevated client vigilance.

To this finish, preventive measures are crucial for protecting bank card customers secure. One such measure is to enhance transaction recognition to find out whether or not a transaction is fraudulent. One of many strategies is to make use of the logistic regression mannequin in machine studying.

Think about that, with some options or situations, logistic regression machine studying can predict whether or not this transaction is fraudulent or not. This mannequin analyzed some options, then calculated the likelihood of those, so the mannequin can resolve whether or not it is a downside or not. This paper will describe it step-by-step to construct this mannequin, so the mannequin will assist the corporate detect this fraud.

Describing the Variety of Fraudulent Transactions in This Dataset
Explaining Methods to Construct the Logistic Regression Machine Studying Mannequin
Assessing the Mannequin’s Effectiveness in Detecting Fraudulent Transactions

We use python language progaming for analysing on this datasets. Some libraries are used similar to numpy and pandas for manipulating dataframe then seaborn and matplotlib for plotting our evaluation. As well as, to advance our evaluation, we use scikit study libraries similar to logistic regression to construct our mannequin to detect the fraud. For dataset, we can use with this link so we can analyze together.

1. Explanatory Knowledge Evaluation

First step, import the important libraries which we are going to make the most of;

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

Then, we learn the database with pandas operate .read_csv then we analyse the data of the datasets, right here is the code;

df = pd.read_csv("creditcard_2023.csv")
df.head()

df.information()

<class 'pandas.core.body.DataFrame'>
RangeIndex: 568630 entries, 0 to 568629
Knowledge columns (complete 31 columns):
#   Column  Non-Null Depend   Dtype  
---  ------  --------------   -----  
0   id      568630 non-null  int64  
1   V1      568630 non-null  float64
2   V2      568630 non-null  float64
3   V3      568630 non-null  float64
4   V4      568630 non-null  float64
5   V5      568630 non-null  float64
6   V6      568630 non-null  float64
7   V7      568630 non-null  float64
8   V8      568630 non-null  float64
9   V9      568630 non-null  float64
10  V10     568630 non-null  float64
11  V11     568630 non-null  float64
12  V12     568630 non-null  float64
13  V13     568630 non-null  float64
14  V14     568630 non-null  float64
15  V15     568630 non-null  float64
16  V16     568630 non-null  float64
17  V17     568630 non-null  float64
18  V18     568630 non-null  float64
19  V19     568630 non-null  float64
20  V20     568630 non-null  float64
21  V21     568630 non-null  float64
22  V22     568630 non-null  float64
23  V23     568630 non-null  float64
24  V24     568630 non-null  float64
25  V25     568630 non-null  float64
26  V26     568630 non-null  float64
27  V27     568630 non-null  float64
28  V28     568630 non-null  float64
29  Quantity  568630 non-null  float64
30  Class   568630 non-null  int64  
dtypes: float64(29), int64(2)
reminiscence utilization: 134.5 MB

To get good lead to our analyze, we’ve to know the data the information, accordingly we are going to learn then we will discern the documentaion of the every coulumn;

id: Distinctive identifier for every transaction
V1-V28: Anonymized options representing numerous transaction attributes (e.g., time, location, and so forth.)
Quantity: The transaction quantity
Class: Binary label indicating whether or not the transaction is fraudulent (1) or not (0)

To understand extra understanding concerning the data, we will use describe operate to know abstract statistic within the knowledge which we are going to concern and know the outline that are Quantity and Class columns.

df[["Amount","Class"]].describe()

we will visualize it with boxplot;

The info accommodates 568,630 transactions, with the smallest fraud amounting to $50 and the best to $24,039. The common fraud transaction is $12,041, with a typical deviation of $6,919. The comparability between fraudulent and non-fraudulent transactions exhibits the identical distribution on this plot, so we wish to look at the precise values of the columns.

df["Class"].value_counts()

Class
0    284315
1    284315
Title: depend, dtype: int64

2. Constructing The Logistic Regression Machine Studying Mannequin

Earlier than constructing our mannequin, we are going to divide the options and goal variables for our machine studying.

X = df.drop(["Class", "id"], axis=1)
y = df["Class"]

We drop class and id columns in X variable as a result of Class for goal variable and id shouldn’t be continous values or nominal quantity. After that, we break up the X and y into practice and take a look at set then we use preprocessing approach to make sure the options have similiar scales ;

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_splitscaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = scaler.fit_transform(X_train)
X_test = scaler.remodel(X_test)

we make it 20% take a look at set and 80% practice set to make the machine studying mannequin. Due to this fact, importing the libraries for creating the mannequin;

from sklearn.linear_model import LogisticRegressionCVlog_model = LogisticRegressionCV()
log_model.match(X_train, y_train)
y_pred = log_model.predict(X_test)

The we wish to know the coefficient of our mannequin, right here is the code;

coefs = pd.DataFrame(index=X.columns,knowledge=log_model.coef_[0], columns=["coefficient"])
coefs["Odd"] = np.exp(coefs["coefficient"])
coefs

We are able to use the np.exp() operate of NumPy to calculate the chances ratio of every coefficient in your logistic regression mannequin. Odds ratio offers an thought of how a one-unit change within the unbiased variable impacts the chances of the result of curiosity (normally the optimistic class in binary classification). For instance, in Quantity function, for the reason that odds ratio is 0.99658, we will say that all unit enhance in “Quantity” will cut back the possibility of the occasion (e.g. fraud) by about 0.34% (1–0.99658 = 0.00342 or 0.342%).

In a sensible context, which means transactions with bigger quantities are inclined to have a barely decrease threat of fraud than transactions with smaller quantities, though the impact could be very small.

3. Measuring of Mannequin Metric Analysis

To examine the metrics our mannequin, we are going to import some libraries in scikit study, the next code is right here;

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_curve

First metric, we are going to examine the accuracy our mannequin;

log_model = LogisticRegressionCV()
log_model.match(X_train, y_train)
y_pred = log_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Accuracy: 96.51%

Our mannequin can predict 96.51% case appropriately whether or not the transaction within the 0 or 1 class. Even, the accuracy is nice, we should always contemplate different matrix to judge our mannequin, to understand the overview the metrics we will use classification_report within the scikit study;

print(classification_report(y_test, y_pred))

                precision    recall  f1-score  help0       0.95      0.98      0.97     56750
1       0.98      0.95      0.96     56976
accuracy                           0.97    113726
macro avg       0.97      0.97      0.97    113726
weighted avg       0.97      0.97      0.97    113726

let’s breakdown this scrath to know the results of the precision, recall, and f1-score. Earlier than that, it’s higher to plot the confusion matrix so we will enter out quantity within the components, right here is the code;

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.present()

To calculate the accuracy components, we sum the true positives (54,260) and true negatives (55,502) after which divide by the full sum of true positives (54,260), true negatives (55,502), false positives (2,716), and false negatives (1,248). This calculation yields 109,762 divided by 113,726, leading to 0.965, which rounds to roughly 0.97, in line with the classification report.

Subsequent, we deal with class 1 to find out the precision, recall, and F1 rating. For precision, the true positives in class 1 (54,260) are divided by the sum of true positives (54,260) and false positives (1,248), leading to 54,260/55,508, which equals roughly 0.977. This worth might be rounded to 0.98. For recall, the true positives in class 1 are divided by the sum of true positives and false negatives (2,716), yielding 54,260/56,976, which is roughly 0.952 or rounded to 0.95. The F1 rating components is twice the product of precision and recall, divided by the sum of precision and recall, leading to 2×0.98×0.95/(0.98+0.95), which calculates to 0.96.

we will plot false optimistic fee and true optimistic fee, the next code is right here;

from sklearn.metrics import roc_curve, aucfpr, tpr, threshold = roc_curve(y_pred, y_test)
roc_auc = auc(fpr, tpr)
plt.determine()
lw = 2
plt.plot(fpr, tpr, shade='darkorange',
lw=lw, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], shade='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Constructive Charge')
plt.ylabel('True Constructive Charge')
plt.grid(True)
plt.title('Receiver Working Attribute')
plt.legend(loc="decrease proper")
plt.present()

The motion true optimistic fee we will see the y-axis signifies the true optimistic fee rating, then x-axis signifies false optimistic fee in our mannequin. We are able to examine the precise worth with the next code;

roc_auc_score(y_test, y_pred)

0.9651698082630356

Subsequent, we wish to plot precision recall curve. it’s helpful as a result of it evaluates the efficiency of binary classification fashions, particularly in circumstances the place the category distribution could be very unbalanced or when the price of misprediction (e.g., false positives) is essential. The next code is right here;

precision, recall, thresholds = precision_recall_curve(y_test, y_pred)
plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.present()

The precise values for the visualization is right here;

precision_recall_curve(y_test, y_pred)

(array([0.50099362, 0.97751675, 1.        ]),
array([1.        , 0.95233081, 0.        ]),
array([0, 1], dtype=int64))

This plot is basically necessary for managing bank card fraud, serving to stakeholders resolve if they need to focus extra on recall or precision of their fraud detection techniques. When selecting a recall threshold of 1, the recall metric turns into 0.500, which suggests half of the actual transactions that aren’t fraud get marked as fraud. This may make clients sad due to too many improper alerts. On the opposite facet, if the brink is ready to 0.977 for precision, it reaches excellent precision, that means each flagged transaction is actually fraud, with no actual transactions wrongly marked. This reduces hassle for patrons however may miss some fraud circumstances.

But, if aiming for excellent precision with a threshold at 1, the recall is 0.952. This exhibits that the majority fraud transactions are caught, however nonetheless, a small little bit of fraud might go unnoticed.

The selection to focus extra on precision or recall in bank card fraud detection should match with what the corporate desires total. If the corporate desires to reduce buyer points and maintain transactions easy, higher precision is nice. But when they wish to be sure that safety is tight and fewer fraud occurs, higher recall is required. This alternative actually is dependent upon what the administration decides, as they should weigh up buyer expertise in opposition to safety wants.

This textual content discusses analyzing bank card fraud and makes an attempt to construct a mannequin that may successfully detect it. The dataset used accommodates 568,630 entries, that includes numerous attributes and a goal variable that specifies whether or not a transaction is fraudulent or not. The info maintains a balanced distribution between fraudulent and non-fraudulent transactions.

We evaluated our mannequin utilizing metrics similar to accuracy, precision, and recall. The mannequin achieved an accuracy of 0.97, that means it could appropriately predict 97% of fraudulent transactions within the dataset. We’re additionally visualizing precision and recall to assist information coverage selections concerning the main target of the outcomes. The goal is to search out the optimum stability between precision and recall in detecting bank card fraud, which can depend upon the strategic targets of the corporate.

Source link

Building an Effective Fraud Detection Credit Card Model with Logistic Regression | by Hakki Hakkari | Apr, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

How Real-Time Data Analytics and AI Are Transforming Heavy Equipment Operations

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Game Development and Cloud Computing: Benefits of Cloud-Native Game Servers

Teradata AI Unlimited in Microsoft Fabric is Now Available for Public Preview through Microsoft Fabric Workload Hub

Cognigy Unveils Agentic AI: Transforming the Future of Enterprise Contact Centers

Our Picks

SVM in 5 minutes. Support Vector Machine | by Krupa Dharamshi | Jun, 2024

How Real-Time Data Analytics and AI Are Transforming Heavy Equipment Operations

Latest Research on Human Activity Recognition part3(AI 2024) – Monodeep Mukherjee

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Building an Effective Fraud Detection Credit Card Model with Logistic Regression | by Hakki Hakkari | Apr, 2024

1. Explanatory Knowledge Evaluation

2. Constructing The Logistic Regression Machine Studying Mannequin

Related Posts