Class distribution:
df['Class'].value_counts()
Class
0 86
1 84
Title: depend, dtype: int64
dataset appears pretty balanced with 86 divorced and 84 nonetheless married.
Prepare Check cut up:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)
The dataset is cut up 80:20 to coach and take a look at respectively.
A number of fashions had been skilled and evaluated to find out the perfect performing one. Mannequin outcomes will present accuracy, confusion matrix and classification studies, for higher understanding.
estimators = {
'Determination Tree' : DecisionTreeClassifier(random_state=3),
'Random Forest' : RandomForestClassifier(random_state=3),
'Additional Tree' : ExtraTreesClassifier(random_state=3),
'Gradient Increase' : GradientBoostingClassifier(random_state=3),
'AdaBoost' : AdaBoostClassifier(random_state=3),
'Logistic Regression' : LogisticRegression(random_state=3),
'SGDC' : SGDClassifier(random_state=3),
'Ridge' : RidgeClassifier(random_state=3)
}
for identify, mannequin in estimators.gadgets():
mannequin.match(X_train, y_train)
y_pred= mannequin.predict(X_test)
# Print the mannequin identify
print(f'{identify}')
# Print the accuracy rating
print(f' Accuracy: {accuracy_score(y_test, y_pred):.3f}')
# Print the confusion matrix
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred)}')
# Print the classification report
print(f' Report: n{classification_report(y_test, y_pred)}')
print("*" * 100)
- Okay-Nearest Neighbors (KNN):
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.match(X_train, y_train)
predicted = knn.predict(X_test)#USE A LIST COMPRESSION TO FIND ANY WRONG PREDICTIONS
wrong_pred = [(p, e) for (p, e) in zip(predicted, expected) if p != e]
wrong_pred #knn wrongly predicted 1 out of our 34 values
[(0, 1)]
print(f'{knn.rating(X_test, y_test):.2%}')
97.06%
KNN, has a predictive accuracy of 97.06%, recording one misprediction out of 34 values.
#Okay FOLD CROSS EXAMINATION
from sklearn.model_selection import KFoldkfold = KFold(n_splits=5, random_state = 8, shuffle=True)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator = knn, X=X_test,
y=y_test, cv=kfold)
scores: array([1. , 1. , 1. , 1. , 0.83333333])
scores.imply() = 96.67%
Cross validation rating 96.67%
2. Determination Tree:
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(random_state=3)
decision_tree.match(X_train, y_train)Determination Tree
Accuracy: 1.000
Confusion Matrix:
[[16 0]
[ 0 18]]
Report:
precision recall f1-score help
0 1.00 1.00 1.00 16
1 1.00 1.00 1.00 18
accuracy 1.00 34
macro avg 1.00 1.00 1.00 34
weighted avg 1.00 1.00 1.00 34
Accuracy: 100%
The Determination Tree mannequin completely categorized all cases within the take a look at set, indicating potential overfitting.
3. Random Forest:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(random_state=3)
random_forest.match(X_train, y_train)Random Forest
Accuracy: 0.971
Confusion Matrix:
[[16 0]
[ 1 17]]
Report:
precision recall f1-score help
0 0.94 1.00 0.97 16
1 1.00 0.94 0.97 18
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted avg 0.97 0.97 0.97 34
Accuracy: 97.1%
Random Forest achieved excessive accuracy, much like KNN, however with barely extra misclassifications in comparison with the Determination Tree.
4. Gradient Increase:
from sklearn.ensemble import GradientBoostingClassifier
gradient_boost = GradientBoostingClassifier(random_state=3)
gradient_boost.match(X_train, y_train)Gradient Increase
Accuracy: 1.000
Confusion Matrix:
[[16 0]
[ 0 18]]
Report:
precision recall f1-score help
0 1.00 1.00 1.00 16
1 1.00 1.00 1.00 18
accuracy 1.00 34
macro avg 1.00 1.00 1.00 34
weighted avg 1.00 1.00 1.00 34
Accuracy: 100%
Just like the Determination Tree, Gradient Boosting additionally achieved good accuracy on the take a look at set.
5. Logistic Regression:
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression(random_state=3)
logistic_regression.match(X_train, y_train)Logistic Regression
Accuracy: 0.971
Confusion Matrix:
[[16 0]
[ 1 17]]
Report:
precision recall f1-score help
0 0.94 1.00 0.97 16
1 1.00 0.94 0.97 18
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted avg 0.97 0.97 0.97 34
Accuracy: 97.1%
Logistic Regression additionally carried out properly, demonstrating its effectiveness for this classification drawback.
6. SGD Classifier and Ridge Classifier:
Each fashions achieved good accuracy, much like the Determination Tree and Gradient Boosting fashions.
Overfitting is perhaps a problem as most of the fashions present 100 accuracy when predicting.
utilizing SMOTE (Artificial Minority Over-sampling Approach) to create artificial samples for higher class stability. As highlighted earlier, our information appears properly balanced however lets see if the mannequin may be improved.
Regularization will assist the mannequin generalize higher.
- SMOTE:
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE# Load the information
df = pd.read_csv(r'C:UsersADMindata analyticsanalyst HQdivorce.csv', delimiter=';')
# Break up information into options and goal
X = df.drop('Class', axis=1)
y = df['Class']
# Break up the information into coaching and testing units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3, stratify=y)
# Apply SMOTE to generate artificial samples
smote = SMOTE(random_state=3)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
# Verify the category distribution after making use of SMOTE
print(f'Class distribution after SMOTE: n{pd.Sequence(y_train_smote).value_counts()}')
Class distribution after SMOTE:
Class
1 69
0 69
Title: depend, dtype: int64
Completely ba….
2. LASSO AND RIDGE LINEAR REGRESSION:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix# L1 Regularization (Lasso)
logreg_l1 = LogisticRegression(penalty='l1', solver='liblinear', random_state=3)
logreg_l1.match(X_train_smote, y_train_smote)
y_pred_l1 = logreg_l1.predict(X_test)
print("L1 Regularization (Lasso) Outcomes:")
print(f' Accuracy: {accuracy_score(y_test, y_pred_l1):.3f}')
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred_l1)}')
print(f' Classification Report: n{classification_report(y_test, y_pred_l1)}')
# L2 Regularization (Ridge)
logreg_l2 = LogisticRegression(penalty='l2', solver='liblinear', random_state=3)
logreg_l2.match(X_train_smote, y_train_smote)
y_pred_l2 = logreg_l2.predict(X_test)
print("L2 Regularization (Ridge) Outcomes:")
print(f' Accuracy: {accuracy_score(y_test, y_pred_l2):.3f}')
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred_l2)}')
print(f' Classification Report: n{classification_report(y_test, y_pred_l2)}')
L1 Regularization (Lasso) Outcomes:
Accuracy: 0.971
Confusion Matrix:
[[17 0]
[ 1 16]]
Classification Report:
precision recall f1-score help
0 0.94 1.00 0.97 17
1 1.00 0.94 0.97 17
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted avg 0.97 0.97 0.97 34
L2 Regularization (Ridge) Outcomes:
Accuracy: 0.971
Confusion Matrix:
[[17 0]
[ 1 16]]
Classification Report:
precision recall f1-score help
0 0.94 1.00 0.97 17
1 1.00 0.94 0.97 17
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted av
3. LASSO AND RIDGE SDGClassifier:
from sklearn.linear_model import SGDClassifier
SGD_l1 = SGDClassifier(penalty='l1', random_state=3)
SGD_l1.match(X_train_smote, y_train_smote)
y_pred_l1 = SGD_l1.predict(X_test)
print("L1 Regularization (Lasso) Outcomes:")
print(f' Accuracy: {accuracy_score(y_test, y_pred_l1):.3f}')
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred_l1)}')
print(f' Classification Report: n{classification_report(y_test, y_pred_l1)}')# L2 Regularization (Ridge)
SGD_l2 = SGDClassifier(penalty='l2', random_state=3)
SGD_l2.match(X_train_smote, y_train_smote)
y_pred_l2 = SGD_l2.predict(X_test)
print("L2 Regularization (Ridge) Outcomes:")
print(f' Accuracy: {accuracy_score(y_test, y_pred_l2):.3f}')
print(f' Confusion Matrix: n{confusion_matrix(y_test, y_pred_l2)}')
print(f' Classification Report: n{classification_report(y_test, y_pred_l2)}')
L1 Regularization (Lasso) Outcomes:
Accuracy: 0.971
Confusion Matrix:
[[16 1]
[ 0 17]]
Classification Report:
precision recall f1-score help
0 1.00 0.94 0.97 17
1 0.94 1.00 0.97 17
accuracy 0.97 34
macro avg 0.97 0.97 0.97 34
weighted avg 0.97 0.97 0.97 34
L2 Regularization (Ridge) Outcomes:
Accuracy: 1.000
Confusion Matrix:
[[17 0]
[ 0 17]]
Classification Report:
precision recall f1-score help
0 1.00 1.00 1.00 17
1 1.00 1.00 1.00 17
accuracy 1.00 34
macro avg 1.00 1.00 1.00 34
weighted avg 1.00 1.00 1.00 34
The lasso and ridge regularization strategies seem to have minimal impression. Notably, solely the lasso regularization on the SGDClassifier reveals a slight impact, decreasing overfitting and decreasing the predictive fee from 100% to 97.1%. Nevertheless, each regularization strategies for linear regression yield the identical predictive worth of 97.1% because the non-regularized mannequin, indicating no vital enchancment.
- Excessive Accuracy Fashions: Determination Tree, Gradient Boosting, and SGD Classifier achieved good accuracy on the take a look at set, inflicting us to do a regularization that didn’t have an effect on a lot. Nevertheless, such excessive accuracy might point out overfitting, necessitating additional validation with a bigger dataset.
- Sturdy Efficiency: KNN, Logistic Regression and Random Forest supplied persistently excessive accuracy with good generalization capabilities.
- Characteristic Significance: Investigating function significance for fashions like Random Forest can present insights into key elements influencing divorce, probably guiding relationship counseling and interventions.
- Future Work: To make sure robustness, additional cross-validation, hyperparameter tuning, and testing on an expanded dataset are advisable.