The world has modified, and so have our strategies of dealing with cash. With refined expertise, we will withdraw or transact cash effectively, flexibly, and easily. Nevertheless, this comfort additionally brings challenges. Some reviews have indicated that there’s an elevated threat of fraud. As transactions grow to be extra versatile, in addition they grow to be extra enticing targets for cybercriminals.
Fraudulent assaults are more and more specializing in people with bank cards. These criminals make use of quite a lot of ways to deceive unsuspecting cardholders and siphon off funds. The rise in such schemes emphasizes the necessity for improved safety measures and elevated client vigilance.
To this finish, preventive measures are crucial for protecting bank card customers secure. One such measure is to enhance transaction recognition to find out whether or not a transaction is fraudulent. One of many strategies is to make use of the logistic regression mannequin in machine studying.
Think about that, with some options or situations, logistic regression machine studying can predict whether or not this transaction is fraudulent or not. This mannequin analyzed some options, then calculated the likelihood of those, so the mannequin can resolve whether or not it is a downside or not. This paper will describe it step-by-step to construct this mannequin, so the mannequin will assist the corporate detect this fraud.
- Describing the Variety of Fraudulent Transactions in This Dataset
- Explaining Methods to Construct the Logistic Regression Machine Studying Mannequin
- Assessing the Mannequin’s Effectiveness in Detecting Fraudulent Transactions
We use python language progaming for analysing on this datasets. Some libraries are used similar to numpy and pandas for manipulating dataframe then seaborn and matplotlib for plotting our evaluation. As well as, to advance our evaluation, we use scikit study libraries similar to logistic regression to construct our mannequin to detect the fraud. For dataset, we can use with this link so we can analyze together.
1. Explanatory Knowledge Evaluation
First step, import the important libraries which we are going to make the most of;
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
Then, we learn the database with pandas operate .read_csv then we analyse the data of the datasets, right here is the code;
df = pd.read_csv("creditcard_2023.csv")
df.head()
df.information()
<class 'pandas.core.body.DataFrame'>
RangeIndex: 568630 entries, 0 to 568629
Knowledge columns (complete 31 columns):
# Column Non-Null Depend Dtype
--- ------ -------------- -----
0 id 568630 non-null int64
1 V1 568630 non-null float64
2 V2 568630 non-null float64
3 V3 568630 non-null float64
4 V4 568630 non-null float64
5 V5 568630 non-null float64
6 V6 568630 non-null float64
7 V7 568630 non-null float64
8 V8 568630 non-null float64
9 V9 568630 non-null float64
10 V10 568630 non-null float64
11 V11 568630 non-null float64
12 V12 568630 non-null float64
13 V13 568630 non-null float64
14 V14 568630 non-null float64
15 V15 568630 non-null float64
16 V16 568630 non-null float64
17 V17 568630 non-null float64
18 V18 568630 non-null float64
19 V19 568630 non-null float64
20 V20 568630 non-null float64
21 V21 568630 non-null float64
22 V22 568630 non-null float64
23 V23 568630 non-null float64
24 V24 568630 non-null float64
25 V25 568630 non-null float64
26 V26 568630 non-null float64
27 V27 568630 non-null float64
28 V28 568630 non-null float64
29 Quantity 568630 non-null float64
30 Class 568630 non-null int64
dtypes: float64(29), int64(2)
reminiscence utilization: 134.5 MB
To get good lead to our analyze, we’ve to know the data the information, accordingly we are going to learn then we will discern the documentaion of the every coulumn;
- id: Distinctive identifier for every transaction
- V1-V28: Anonymized options representing numerous transaction attributes (e.g., time, location, and so forth.)
- Quantity: The transaction quantity
- Class: Binary label indicating whether or not the transaction is fraudulent (1) or not (0)
To understand extra understanding concerning the data, we will use describe operate to know abstract statistic within the knowledge which we are going to concern and know the outline that are Quantity and Class columns.
df[["Amount","Class"]].describe()
we will visualize it with boxplot;
The info accommodates 568,630 transactions, with the smallest fraud amounting to $50 and the best to $24,039. The common fraud transaction is $12,041, with a typical deviation of $6,919. The comparability between fraudulent and non-fraudulent transactions exhibits the identical distribution on this plot, so we wish to look at the precise values of the columns.
df["Class"].value_counts()
Class
0 284315
1 284315
Title: depend, dtype: int64
2. Constructing The Logistic Regression Machine Studying Mannequin
Earlier than constructing our mannequin, we are going to divide the options and goal variables for our machine studying.
X = df.drop(["Class", "id"], axis=1)
y = df["Class"]
We drop class and id columns in X variable as a result of Class for goal variable and id shouldn’t be continous values or nominal quantity. After that, we break up the X and y into practice and take a look at set then we use preprocessing approach to make sure the options have similiar scales ;
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_splitscaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = scaler.fit_transform(X_train)
X_test = scaler.remodel(X_test)
we make it 20% take a look at set and 80% practice set to make the machine studying mannequin. Due to this fact, importing the libraries for creating the mannequin;
from sklearn.linear_model import LogisticRegressionCVlog_model = LogisticRegressionCV()
log_model.match(X_train, y_train)
y_pred = log_model.predict(X_test)
The we wish to know the coefficient of our mannequin, right here is the code;
coefs = pd.DataFrame(index=X.columns,knowledge=log_model.coef_[0], columns=["coefficient"])
coefs["Odd"] = np.exp(coefs["coefficient"])
coefs
We are able to use the np.exp() operate of NumPy to calculate the chances ratio of every coefficient in your logistic regression mannequin. Odds ratio offers an thought of how a one-unit change within the unbiased variable impacts the chances of the result of curiosity (normally the optimistic class in binary classification). For instance, in Quantity function, for the reason that odds ratio is 0.99658, we will say that all unit enhance in “Quantity” will cut back the possibility of the occasion (e.g. fraud) by about 0.34% (1–0.99658 = 0.00342 or 0.342%).
In a sensible context, which means transactions with bigger quantities are inclined to have a barely decrease threat of fraud than transactions with smaller quantities, though the impact could be very small.
3. Measuring of Mannequin Metric Analysis
To examine the metrics our mannequin, we are going to import some libraries in scikit study, the next code is right here;
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import precision_recall_curve
First metric, we are going to examine the accuracy our mannequin;
log_model = LogisticRegressionCV()
log_model.match(X_train, y_train)
y_pred = log_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
Accuracy: 96.51%
Our mannequin can predict 96.51% case appropriately whether or not the transaction within the 0 or 1 class. Even, the accuracy is nice, we should always contemplate different matrix to judge our mannequin, to understand the overview the metrics we will use classification_report within the scikit study;
print(classification_report(y_test, y_pred))
precision recall f1-score help0 0.95 0.98 0.97 56750
1 0.98 0.95 0.96 56976
accuracy 0.97 113726
macro avg 0.97 0.97 0.97 113726
weighted avg 0.97 0.97 0.97 113726
let’s breakdown this scrath to know the results of the precision, recall, and f1-score. Earlier than that, it’s higher to plot the confusion matrix so we will enter out quantity within the components, right here is the code;
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.present()
To calculate the accuracy components, we sum the true positives (54,260) and true negatives (55,502) after which divide by the full sum of true positives (54,260), true negatives (55,502), false positives (2,716), and false negatives (1,248). This calculation yields 109,762 divided by 113,726, leading to 0.965, which rounds to roughly 0.97, in line with the classification report.
Subsequent, we deal with class 1 to find out the precision, recall, and F1 rating. For precision, the true positives in class 1 (54,260) are divided by the sum of true positives (54,260) and false positives (1,248), leading to 54,260/55,508, which equals roughly 0.977. This worth might be rounded to 0.98. For recall, the true positives in class 1 are divided by the sum of true positives and false negatives (2,716), yielding 54,260/56,976, which is roughly 0.952 or rounded to 0.95. The F1 rating components is twice the product of precision and recall, divided by the sum of precision and recall, leading to 2×0.98×0.95/(0.98+0.95), which calculates to 0.96.
we will plot false optimistic fee and true optimistic fee, the next code is right here;
from sklearn.metrics import roc_curve, aucfpr, tpr, threshold = roc_curve(y_pred, y_test)
roc_auc = auc(fpr, tpr)
plt.determine()
lw = 2
plt.plot(fpr, tpr, shade='darkorange',
lw=lw, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], shade='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Constructive Charge')
plt.ylabel('True Constructive Charge')
plt.grid(True)
plt.title('Receiver Working Attribute')
plt.legend(loc="decrease proper")
plt.present()
The motion true optimistic fee we will see the y-axis signifies the true optimistic fee rating, then x-axis signifies false optimistic fee in our mannequin. We are able to examine the precise worth with the next code;
roc_auc_score(y_test, y_pred)
0.9651698082630356
Subsequent, we wish to plot precision recall curve. it’s helpful as a result of it evaluates the efficiency of binary classification fashions, particularly in circumstances the place the category distribution could be very unbalanced or when the price of misprediction (e.g., false positives) is essential. The next code is right here;
precision, recall, thresholds = precision_recall_curve(y_test, y_pred)
plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.present()
The precise values for the visualization is right here;
precision_recall_curve(y_test, y_pred)
(array([0.50099362, 0.97751675, 1. ]),
array([1. , 0.95233081, 0. ]),
array([0, 1], dtype=int64))
This plot is basically necessary for managing bank card fraud, serving to stakeholders resolve if they need to focus extra on recall or precision of their fraud detection techniques. When selecting a recall threshold of 1, the recall metric turns into 0.500, which suggests half of the actual transactions that aren’t fraud get marked as fraud. This may make clients sad due to too many improper alerts. On the opposite facet, if the brink is ready to 0.977 for precision, it reaches excellent precision, that means each flagged transaction is actually fraud, with no actual transactions wrongly marked. This reduces hassle for patrons however may miss some fraud circumstances.
But, if aiming for excellent precision with a threshold at 1, the recall is 0.952. This exhibits that the majority fraud transactions are caught, however nonetheless, a small little bit of fraud might go unnoticed.
The selection to focus extra on precision or recall in bank card fraud detection should match with what the corporate desires total. If the corporate desires to reduce buyer points and maintain transactions easy, higher precision is nice. But when they wish to be sure that safety is tight and fewer fraud occurs, higher recall is required. This alternative actually is dependent upon what the administration decides, as they should weigh up buyer expertise in opposition to safety wants.
This textual content discusses analyzing bank card fraud and makes an attempt to construct a mannequin that may successfully detect it. The dataset used accommodates 568,630 entries, that includes numerous attributes and a goal variable that specifies whether or not a transaction is fraudulent or not. The info maintains a balanced distribution between fraudulent and non-fraudulent transactions.
We evaluated our mannequin utilizing metrics similar to accuracy, precision, and recall. The mannequin achieved an accuracy of 0.97, that means it could appropriately predict 97% of fraudulent transactions within the dataset. We’re additionally visualizing precision and recall to assist information coverage selections concerning the main target of the outcomes. The goal is to search out the optimum stability between precision and recall in detecting bank card fraud, which can depend upon the strategic targets of the corporate.