From Exploration to Modeling: When ML Models Fail in Action — A Case Study | by Mahdi Mohseni | Jun, 2024

Recently, I was requested to find a vehicle insurance coverage protection premium dataset and predict a “truthful” premium for each policyholder. The responsibility included exploratory data analysis, attribute engineering, model selection and comparability, and concepts for model enchancment. It was an attention-grabbing and insightful exercise, as our expectations usually fail and we are going to’t kind out the problem in a routine and easy methodology. Proper right here, I contact upon the ultimate course of and particulars. The longer mannequin of the analyses may be current in this GitHub repository.

The dataset belongs to a French vehicle insurer and is in two elements, which may be downloaded at this and this hyperlinks. One desk contains particulars about policyholders, and the alternative desk consists of claims and the amount of claims for each policyholder, if any. Although the responsibility is the prediction of a “truthful” premium for each policyholder, in essence, insurance coverage protection companies purpose to not lose money in the long term whereas moreover rising the number of their purchasers.

Let’s have a look on the knowledge. The two tables are in arff format, so I first load them and convert them to dataframe format.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import arff

data_freq = arff.load("data/freMTPL2freq.arff")
df_freq = pd.DataFrame(
data_freq,
columns=[
"IDpol",
"ClaimNb",
"Exposure",
"Area",
"VehPower",
"VehAge",
"DrivAge",
"BonusMalus",
"VehBrand",
"VehGas",
"Density",
"Region",
],
)
data_sev = arff.load("data/freMTPL2sev.arff")
df_sev = pd.DataFrame(data_sev, columns=["IDpol", "ClaimAmount"])

As talked about above, the first desk contains particulars about policyholders. In full, there are better than 678K knowledge on this dataset.

print(df_freq.type)
df_freq.head()

(678013, 12)

The number of claims recorded throughout the second desk is 26,636.

print(df_sev.type)
df_sev.head()

(26639, 2)

As some policyholders have had a few declare, we should all the time mixture all the amount of claims for each policyholder. As confirmed underneath, the number of policyholders who’ve filed claims a minimal of as quickly as is 24,950.

# sum of claims per policyholder
df_sev_agg = (
df_sev.groupby("IDpol")["ClaimAmount"]
.sum()
.to_frame()
.rename(columns={"ClaimAmount": "SumClaimAmount"})
.reset_index()
)
df_sev_agg.type

(24950, 2)

Now we are going to merge the two tables as confirmed underneath. Evaluating the scale of df_sev_agg with the non-null values throughout the merged dataframe, df_merged, reveals that for some claims, we shouldn’t have particulars about their policyholders, and their data is missing. The left be part of has already eradicated these knowledge from our dataset.

df_merged = pd.merge(df_freq, df_sev_agg, on="IDpol", how="left")
df_merged["SumClaimAmount"].notna().value_counts()

SumClaimAmount
False    653069
True      24944
Determine: rely, dtype: int64

The dependent variable is the sum of declare portions divided by the publicity, which is the size of an insurance coverage protection contract per yr. We compute the purpose variable and add it to the dataset as confirmed underneath.

Out of 678,013 contracts, solely 24,944 contracts have had a declare, which is about 3.7%. Which means that the dataset is extraordinarily unbalanced. To analysis the variables, we’ll give consideration to contracts with declare costs. When modeling, we are going to each analyze solely contracts with declare costs or consider all contracts whereas being aware of the unbalanced classes.

# Calculate the purpose variable
df_merged["target"] = df_merged["SumClaimAmount"] / df_merged["Exposure"]# Define numerical and categorical columns
num_cols = [
"Exposure",
"SumClaimAmount",
"VehPower",
"VehAge",
"DrivAge",
"BonusMalus",
"Density",
"target",
]
cat_cols = ["Area", "VehBrand", "VehGas", "Region"]
# Create a combined DataFrame with numerical columns and categorical columns remodeled to class kind
df_all = pd.concat(
[df_merged[num_cols]] + [df_merged[c].astype("class") for c in cat_cols], axis=1
)
# Dataframe of contracts with a declare worth
df = df_all[df_all["SumClaimAmount"].notna()]
# Add 'HasClaim' column to df_all indicating whether or not or not there was a declare
df_all["HasClaim"] = df_all["target"].notna()
# Fill missing 'purpose' values with 0.0
df_all["target"] = df_all["target"].fillna(0.0)
# df = df_all

First, we familiarize ourselves with the purpose variable.

fig, axs = plt.subplots(1, 3, figsize=(12, 4))
_ = axs[0].hist(df["target"], bins=100)
axs[0].set_title("Hist. of Purpose")
_ = axs[1].hist(df["target"].map(np.log10), bins=100)
axs[1].set_title("Hist. of log(Purpose)")
_ = axs[2].boxplot(df["target"].map(np.log10), labels=["log(Target)"])
_ = axs[2].set_title("Boxplot of Purpose")

The plots level out that the dependent variable, purpose, shows a extremely big range and is extraordinarily skewed. Furthermore, there are some data elements with exceptionally extreme or low values, which might be thought-about outliers. Nonetheless, we must be cautious about labeling a data degree as an “outlier”, since we’re dealing with an insurance coverage protection downside.

May we study the place these outliers come from? Since purpose is calculated by SumClaimAmount divided by Publicity, it is worth analyzing each of these variables.

fig, axs = plt.subplots(1, 3, figsize=(12, 3))
_ = axs[0].hist(df["Exposure"], bins=50)
axs[0].set_title("Publicity")
_ = axs[1].hist(df["SumClaimAmount"], bins=50)
axs[1].set_title("SumClaimAmount")
_ = axs[2].hist(df["SumClaimAmount"].map(np.log10), bins=50)
_ = axs[2].set_title("log(SumClaimAmount)")
plt.tight_layout()

The plots reveal conditions throughout the dataset the place Publicity values are very low, resulting in disproportionately extreme values for the dependent variable. Nonetheless, considering the low likelihood of elevating a declare, defining the dependent variable as SumClaimAmount divided by Publicity as a measure of the anticipated declare worth for each policyholder appears to overestimate the charge. This, in flip, ends in fairly just a few extreme values, making the distribution of the dependent variable extraordinarily skewed.

Alternatively, there are cases with very extreme SumClaimAmount. Thus, eradicating conditions that look like outliers requires cautious consideration.

As purpose has an infinite range and is skewed, we might also consider analyzing and using the log-transformed values of it.

df["targetLog"] = df["target"].map(np.log10)
_ = sns.histplot(data=df, x="targetLog")

If we decide to remove outliers, we are going to use the IQR or Z-Score methodology. Proper right here, the latter is used on the log values of the response variable.

# Z-Score
z_scores = stats.zscore(df["targetLog"])
threshold = 3
outliers = (z_scores < -threshold) | (z_scores > threshold)
df["IsOutlier"] = outliers

print(f"Number of outliers: {outliers.value_counts()[True]}")

Number of outliers: 299

Specializing in policyholders with declare costs, we analyze the numerical variables. Throughout the Jupyter pocket e-book on GitHub, you probably can see the plots for each variable. Proper right here, I briefly level out them:

VehPower: Automotive power
VehAge: Automotive Age ranges from 0 to 100. Solely 20 conditions have VehAge > 30. These conditions might probably be eradicated all through modeling and dealt with differently later if essential.
DrivAge: Solely 14 conditions have DrivAge > 90. These conditions might probably be eradicated all through modeling and dealt with differently later if essential.
BonusMalus: Solely 40 conditions have BonusMalus > 150. These conditions might probably be eradicated all through modeling and dealt with differently later if essential.
Density: As confirmed throughout the plots underneath, density has a extremely big range, and its distribution seems to adjust to an affect laws. So, we are going to moreover convert it proper right into a log scale to resemble a standard distribution.

fig, axs = plt.subplots(2, 2, figsize=(12, 8), width_ratios=(1, 1.5))
axs = axs.flatten()
_ = sns.histplot(data=df, x="Density", ax=axs[0])
bins = pd.reduce(df["Density"], correct=False, bins=range(0, 30000, 500))
df_tmp = df.groupby(bins, observed=True)["target"].suggest().to_frame().reset_index()
_ = sns.barplot(df_tmp, x="Density", y="purpose", errorbar=None, ax=axs[1])
axs[1].set_ylabel("Suggest Purpose")
axs[1].tick_params(axis="x", rotation=90)
df["DensityLog"] = df["Density"].map(np.log10)
df_all["DensityLog"] = df_all["Density"].map(np.log10)
_ = sns.histplot(df["DensityLog"], ax=axs[2])
bins = pd.reduce(df["DensityLog"], correct=False, bins=np.linspace(0, 5, 50))
df_tmp = df.groupby(bins, observed=True)["target"].suggest().to_frame().reset_index()
_ = sns.barplot(df_tmp, x="DensityLog", y="purpose", errorbar=None, ax=axs[3])
axs[3].tick_params(axis="x", rotation=90)
plt.tight_layout()

As confirmed throughout the plot underneath, the numerical variables exhibit very low correlation with every purpose and targetLog. Subsequently, these choices might be not extraordinarily environment friendly in predicting the dependent variables.

plt.decide(figsize=(8, 8))
cols = [
"target",
"targetLog",
"SumClaimAmount",
"Exposure",
"VehPower",
"VehAge",
"DrivAge",
"BonusMalus",
"Density",
"DensityLog",
]
_ = sns.heatmap(df[cols].corr(), annot=True, linewidths=1, fmt=".3f")

You probably can see the plots for the particular variables throughout the Jupyter pocket e-book on GitHub. Proper right here, we conduct statistical analyses of these variables. An ANOVA check out reveals that the distribution of the purpose variable should not be significantly utterly completely different all through all categorical variables. Nonetheless, some categorical variables exhibit a significant distinction for targetLog. Consequently, modeling targetLog might yield greater outcomes.

def ANOVA_test(target_col):
for c in cat_cols:
df_tmp = [g[1] for g in df.groupby([c], observed=True)[target_col]]
_, pvalue = stats.f_oneway(*df_tmp)
print(f"Variable: {c}tp-value: {pvalue}")print("Dependent Variable: Purpose")
ANOVA_test("purpose")
print("nnDependent Variable: log(Purpose)")
ANOVA_test("targetLog")

Dependent Variable: Purpose
Variable: House    p-value: 0.5621655789253363
Variable: VehBrand    p-value: 0.6401726096346133
Variable: VehGas    p-value: 0.22047970212461002
Variable: Space    p-value: 0.9735818647330804Dependent Variable: log(Purpose)
Variable: House    p-value: 2.864841485854489e-08
Variable: VehBrand    p-value: 3.6498732677036494e-60
Variable: VehGas    p-value: 0.8569818310183478
Variable: Space    p-value: 6.55718358261526e-36

The statistical analysis of the choices revealed that we haven’t any standout choices that may be utilized for predicting the dependent variable. First, we try to predict the anticipated declare amount for a policyholder given that he or she has a declare.

We attempt three fashions:

SVR (Assist Vector Regression)
XGBoost
Random Forest

import xgboost as xgb
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVRdef get_SVR_model():
preprocess = ColumnTransformer(
transformers=[
(
"Cat",
Pipeline(
[
("OneHot", preprocessing.OneHotEncoder()),
("Normalizer", preprocessing.StandardScaler(with_mean=False)),
]
),
cat_cols,
),
("Num-Normalizer", preprocessing.StandardScaler(), num_cols),
]
)
model = SVR()
pipeline = Pipeline([("preprocess", preprocess), ("svr_model", model)])
return pipeline
def get_XGB_model():
preprocess = ColumnTransformer(
[
("OrdinalEncoder", preprocessing.OrdinalEncoder(), cat_cols),
("Normalizer", preprocessing.StandardScaler(), num_cols),
]
)
model = xgb.XGBRegressor()
pipeline = Pipeline([("preprocess", preprocess), ("xgb_model", model)])
return pipeline
def get_RF_model():
preprocess = ColumnTransformer(
[
("OrdinalEncoder", preprocessing.OrdinalEncoder(), cat_cols),
("Normalizer", preprocessing.StandardScaler(), num_cols),
]
)
model = RandomForestRegressor()
pipeline = Pipeline([("preprocess", preprocess), ("rf_model", model)])
return pipeline
def get_test_train(df, cols, purpose):
X = df[cols]
y = df[target]
return train_test_split(X, y, test_size=0.20, random_state=37)
def put together(model):
model.match(train_x, train_y)
pred_y = model.predict(test_x)
mse = mean_squared_error(test_y, pred_y)
return mse
def filter_dataset(df):
df_tmp = df.copy()
df_tmp = df_tmp[df_tmp["VehAge"] <= 30]
df_tmp = df_tmp[df_tmp["DrivAge"] <= 90]
df_tmp = df_tmp[df_tmp["BonusMalus"] <= 150]
df_tmp = df_tmp[~df_tmp["IsOutlier"]]
return df_tmp

We use suggest sq. error to guage outcomes and consider them with a baseline, which is the suggest of SumClaimAmount. We put together and try fashions for every purpose and targetLog.

num_cols = ["VehPower", "VehAge", "DrivAge", "BonusMalus", "DensityLog"]
cat_cols = ["Area", "VehBrand", "VehGas", "Region"]
cols = cat_cols + num_colsdf_tmp = filter_dataset(df)
purpose = "purpose"
train_x, test_x, train_y, test_y = get_test_train(df_tmp, cols, purpose)
mse = mean_squared_error(test_y, [train_y.mean()] * len(test_y))
print(f"Baseline - Root suggest squared error (RMSE): {np.sqrt(mse):.2f}")
rf_model = get_RF_model()
mse = put together(rf_model)
print(f"nRandomForest - Root suggest squared error (RMSE): {np.sqrt(mse):.2f}")
xgb_model = get_XGB_model()
mse = put together(xgb_model)
print(f"nXGB - Root suggest squared error (RMSE): {np.sqrt(mse):.2f}")
svr_model = get_SVR_model()
mse = put together(svr_model)
print(f"nSVR - Root suggest squared error (RMSE): {np.sqrt(mse):.2f}")

Baseline - Root suggest squared error (RMSE): 9983.06
RandomForest - Root suggest squared error (RMSE): 10464.24
XGB - Root suggest squared error (RMSE): 10568.72
SVR - Root suggest squared error (RMSE): 10333.84

purpose = "targetLog"
train_x, test_x, train_y, test_y = get_test_train(df_tmp, cols, purpose)
mse = mean_squared_error(test_y, [train_y.mean()] * len(test_y))
print(f"Baseline - Root suggest squared error (RMSE): {np.power(10, np.sqrt(mse)):.2f}")
rf_model = get_RF_model()
mse = put together(rf_model)
print(
f"nRandomForest - Root suggest squared error (RMSE): {np.power(10, np.sqrt(mse)):.2f}"
)
xgb_model = get_XGB_model()
mse = put together(xgb_model)
print(f"nXGB - Root suggest squared error (RMSE): {np.power(10, np.sqrt(mse)):.2f}")
svr_model = get_SVR_model()
mse = put together(svr_model)
print(f"nSVR - Root suggest squared error (RMSE): {np.power(10, np.sqrt(mse)):.2f}")

Baseline - Root suggest squared error (RMSE): 3.69
RandomForest - Root suggest squared error (RMSE): 3.73
XGB - Root suggest squared error (RMSE): 3.74
SVR - Root suggest squared error (RMSE): 3.64

The above fashions did not perform satisfactorily. In actuality, they carried out worse than the baseline, which was merely the suggest of the purpose variable. Nonetheless, these outcomes aren’t gorgeous given the outcomes of the statistical exams we observed beforehand.

We might strive parameter optimization proper right here. On this post and this GitHub repository, I’ve outlined how one can optimize hyperparameters for XGBoost, and the tactic can merely be generalized for various fashions as correctly. Nonetheless, given the poor effectivity of the fashions with the default settings, I don’t assume parameter optimization will change one thing proper right here.

We didn’t get satisfactory outcomes with regression. Now, I’m interested in investigating whether or not or not we are going to predict if a policyholder recordsdata a declare using the choices. Thus, we convert the problem proper right into a binary classification downside.

We already know that throughout the dataset, solely 3.7% of policyholders have filed a declare, whereas 96.3% have not. Such a extraordinarily imbalanced dataset requires cautious coping with of the imbalanced classes. Detecting policyholders who file a declare, i.e., recall, is additional obligatory than the accuracy or precision of the model. By the usage of class weights and weighting the constructive class (these conditions with HasClaim=True), we are going to data the model within the path of our desired remaining consequence.

For this classification exercise, I exploit logistic regression.

import torch
from torch.utils.data import DataLoader, TensorDatasetnum_cols = ["VehPower", "VehAge", "DrivAge", "BonusMalus", "DensityLog"]
cat_cols = ["Area", "VehBrand", "VehGas", "Region"]
# Preprocess the data
preprocess = ColumnTransformer(
transformers=[
(
"cat",
Pipeline(
[
("OneHot", preprocessing.OneHotEncoder()),
(
"Normalizer",
preprocessing.StandardScaler(with_mean=False),
),  
]
),
cat_cols,
),
("num", preprocessing.StandardScaler(), num_cols),
]
)
purpose = "HasClaim"
# Break up the data into teaching and try models
X = df_all[num_cols + cat_cols]
y = df_all[target]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=37
)
X_train = preprocess.fit_transform(X_train).toarray()
X_test = preprocess.fit_transform(X_test).toarray()
X_train_torch = torch.tensor(X_train, dtype=torch.float32)
y_train_torch = torch.tensor(document(y_train), dtype=torch.float32)
X_test_torch = torch.tensor(X_test, dtype=torch.float32)
y_test_torch = torch.tensor(document(y_test), dtype=torch.float32)
# Create datasets
train_dataset = TensorDataset(X_train_torch, y_train_torch)
test_dataset = TensorDataset(X_test_torch, y_test_torch)
# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

import torch.nn as nn
from sklearn.utils.class_weight import compute_class_weight
from tqdm import tqdmsystem = torch.system("cuda" if torch.cuda.is_available() else "cpu")
class LogisticRegression(nn.Module):
def __init__(self, input_dim):
super(LogisticRegression, self).__init__()
self.linear = nn.Linear(input_dim, 1)
def forward(self, x):
return torch.sigmoid(self.linear(x))
def train_model(num_epochs, positive_class_weight=None):
if positive_class_weight should not be None:
class_weights = compute_class_weight(
"balanced", classes=np.distinctive(y_train), y=y_train
)
class_weight_tensor = torch.tensor(
[positive_class_weight * class_weights[1]], dtype=torch.float32
)
# Define the loss carry out with class weight
criterion = nn.BCEWithLogitsLoss(pos_weight=class_weight_tensor)
else:
criterion = nn.BCEWithLogitsLoss()
# Switch the criterion to GPU if obtainable
model = LogisticRegression(X_train.type[1]).to(system)
criterion = criterion.to(system)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
print("Teaching ...")
for epoch in range(num_epochs):
model.put together()
for data, purpose in train_loader:
data, purpose = data.to(system), purpose.to(system)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, purpose.unsqueeze(1))
loss.backward()
optimizer.step()
if (epoch + 1) % 5 == 0:
print(f"Epoch {epoch+1}, Loss: {loss.merchandise()}")
return model
def think about(model):
model.eval()
y_pred = []
with torch.no_grad():
for data, purpose in test_loader:
data, purpose = data.to(system), purpose.to(system)
outputs = model(data)
predicted = (outputs.data > 0.5).float()
y_pred += predicted.tolist()
print(classification_report(y_test, y_pred, zero_division=0))

We’ll try utterly completely different values for the burden of the constructive class. If we set it with the following value, we are going to receive a recall of 73%, on the expense of lower precision and accuracy.

model = train_model(50, positive_class_weight=4)
think about(model)

Teaching ...
Epoch 5, Loss: 1.093119740486145
Epoch 10, Loss: 1.0352026224136353
Epoch 15, Loss: 1.1928868293762207
Epoch 20, Loss: 2.6955227851867676
Epoch 25, Loss: 1.1254448890686035
Epoch 30, Loss: 0.9496679306030273
Epoch 35, Loss: 0.9839045405387878
Epoch 40, Loss: 1.033744215965271
Epoch 45, Loss: 0.9856802225112915
Epoch 50, Loss: 1.0736278295516968
precision    recall  f1-score   assist
False       0.98      0.42      0.58    130699
True       0.05      0.73      0.08      4904
accuracy                           0.43    135603
macro avg       0.51      0.57      0.33    135603
weighted avg       0.94      0.43      0.56    135603

Hypothetically, the insurance coverage protection premium for each policyholder can initially be set based totally on the possibility of submitting a declare, with a safe margin included. Then, we use the classification model to control the premium: if the policyholder is classed as constructive, we improve their premium proportionally based totally on the statistics of beforehand filed ClaimAmounts.

This was a easy analysis of insurance coverage protection premium modeling and supposed for instance that machine finding out fashions couldn’t always meet our preliminary expectations. Nonetheless, by means of a shift in perspective and an intensive analysis of the problem, we are going to nonetheless benefit from their potential efficiently.

Source link

From Exploration to Modeling: When ML Models Fail in Action — A Case Study | by Mahdi Mohseni | Jun, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Salesforce Introduces Agentforce Testing Center: AI Agent Lifecycle Management Tooling for Testing Autonomous AI Agents at Scale

70% of Firms Disrupted by AI: New Endava Research

How Real-Time Data Analytics and AI Are Transforming Heavy Equipment Operations

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Game Development and Cloud Computing: Benefits of Cloud-Native Game Servers

Our Picks

RetouchMe Pricing, Pros Cons, Features, Alternatives

Step-by-Step Guide to Creating Simulated Data in Python | by Marcus Sena | Jul, 2024

DeepMind &UCL RL Lecture Note (2021): Lecture 1 | by Hanho Ryu | Jun, 2024

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

From Exploration to Modeling: When ML Models Fail in Action — A Case Study | by Mahdi Mohseni | Jun, 2024

Related Posts