Segmentation of consumers primarily based on their traits is a crucial enterprise technique for personalized promotions and provides. We tried to incorporate segmentation by the Okay-Means clustering methodology primarily based on the optimum variety of clusters as derived from the PCA evaluation and t-SNE. Determine 8 conveys that segmentation utilizing Okay-Means clustering just isn’t an method for this dataset. Go to the project repository for the code.
We’ve accomplished the EDA portion of the venture and we are going to transfer to the machine studying a part of it.
Machine Studying Fashions and Optimization
Information Preparation
The core of this venture is to develop a machine-learning mannequin that finest predicts the goal variable for a brand new shopper. It’s essential to arrange the info for modeling and the principle steps for preparation had been —
- Standardizing identifiers: To keep away from any errors or warnings whereas modeling, we standardized the column names of the dataset by eradicating the areas and including underscores as a substitute.
- Encoding the Categorical variables: All of the ML fashions require numerical knowledge to course of and subsequently, we encoded all the thing sort variables utilizing the
LabelEncoder()
operate. - Splitting the Dataset: After eradicating the goal variable from the dataset, the info body was break up into practice and take a look at units with a take a look at dimension of 30%.
# Standardizing column names
df.columns = [x.lower() for x in df.columns]
df.columns = df.columns.str.exchange(' ', '_')# Categorical variable encoding
for col in df_cat: # df_cat is the df of categorical variables
le = LabelEncoder()
le.match(df_cat[col])
df_cat[col] = le.rework(df_cat[col])
# Ultimate dataset earlier than splitting combining categorical and numerical variables
df_final =pd.concat([df_num,df_cat],axis=1)
# Splitting the dataset
from sklearn.model_selection import train_test_split
X = df_final.drop(['customer_lifetime_value','policy_type','policy'],axis=1)
y = df_final['customer_lifetime_value']
y = np.log(y)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state=42)
The information is prepared for modeling!
Machine Studying Modeling
- Linear Regression with Lasso (L1) and Ridge (L2) regularizations
For Linear Regression, we utilized the Lasso (L1) and Ridge (L2) Regularizations to optimize the efficiency metrics of the mannequin by avoiding over-fitting. Learn extra about regularizations here.
# import libraries for regression and efficiency evaluation
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error# Lasso regression becoming and prediction
lasso = Lasso(random_state=1)
lasso.match(X_train,y_train)
y_pred = lasso.predict(X_test)
# Efficiency Metrics for Lasso
print('RMSE:',np.sqrt(mean_squared_error(y_test, y_pred)))
print('R-Squared:', r2_score(y_test, y_pred))
print(f'R^2 rating for practice: {lasso.rating(X_train, y_train)}')
print(f'R^2 rating for take a look at: {lasso.rating(X_test, y_test)}')
# Ridge regression becoming and prediction
ridge = Ridge(alpha=0.1)
ridge.match(X_train,y_train)
y_pred = ridge.predict(X_test)
# Efficiency Metrics for Ridge
print('RMSE:',np.sqrt(mean_squared_error(y_test, y_pred)))
print('R-Squared:', r2_score(y_test, y_pred))
print(f'R^2 rating for practice: {ridge.rating(X_train, y_train)}')
print(f'R^2 rating for take a look at: {ridge.rating(X_test, y_test)}')
Determine 9 provides the efficiency metrics of the Linear regression mannequin regularized by the Lasso and Ridge fashions. It has a low R-squared on each practice and take a look at knowledge units.
2. Choice Tree Regressor
On modeling the info with the Choice Tree Regressor, we encounter an R-squared of 1.0 on the practice knowledge and solely 0.84 on the take a look at knowledge indicating an over-fitting of the mannequin on the practice dataset.
# import the library from sci-kit be taught
from sklearn.tree import DecisionTreeRegressor# Becoming and predicting with Choice Tree
dt = DecisionTreeRegressor(random_state=1)
dt.match(X_train, y_train)
y_pred =dt.predict(X_test)
# Efficiency metrics for Choice Tree Regressor
print('RMSE:',np.sqrt(mean_squared_error(y_test, y_pred)))
print('MAE:', mean_absolute_error(y_test, y_pred) )
print('R-Squared:', r2_score(y_test, y_pred))
print(f'R^2 rating for practice: {dt.rating(X_train, y_train)}')
print(f'R^2 rating for take a look at: {dt.rating(X_test, y_test)}')
3. Random Forest Regressor
With the Random Forest Regressor, we get hold of a practice knowledge accuracy (R²) of 98% and a take a look at knowledge accuracy of 90%. Amongst all of the fashions till this level, Random Forest Regressor proves to be probably the most dependable prediction mannequin.
# import the library
from sklearn.ensemble import RandomForestRegressor# Becoming and predicting utilizing Random Forest Regressor
rf = RandomForestRegressor(n_estimators=10, random_state=1)
rf.match(X_train, y_train)
y_pred = rf.predict(X_test)
# Performace metrics for Random Forest Regressor
print('RMSE:',np.sqrt(mean_squared_error(y_test, y_pred)))
print('MAE:', mean_absolute_error(y_test, y_pred) )
print('R-Squared:', r2_score(y_test, y_pred))
print(f'R^2 rating for practice: {rf.rating(X_train, y_train)}')
print(f'R^2 rating for take a look at: {rf.rating(X_test, y_test)}')
Along with the above fashions, we carried out the Hypertuned-Random Forest, AdaBoost, and Neural Community fashions. Nevertheless, the take a look at accuracies of those had been decrease than the Random Forest Regressor mannequin. Despite the fact that the hyper-tuning of parameters improved the accuracy of the Random Forest mannequin, it required larger computational effectivity and was discarded. Determine 12 exhibits the abstract of the efficiency metrics of all of the fashions examined on the dataset.
To conclude, we thought-about the Random Forest Regressor for predicting the Buyer Lifetime Worth of the brand new prospects.
Person-Interface for CLV Prediction
For Firm X to know buyer conduct earlier than issuing an insurance coverage coverage, we developed a web site scripted in HTML and styled by CSS and JavaScript that provides an instantaneous prediction of the Buyer Lifetime Worth of any potential shopper taking into account the traits we utilized for the prediction mannequin.
We transformed the Random Forest prediction mannequin right into a .sav
file and related it to an HTML file utilizing the Python libraries pickle
(for connecting to the .sav
file and executing) and flask
(rendering the webpage upon operating the mannequin).
Determine 13 is the web site’s touchdown web page that predicts the Buyer Lifetime Worth for any distinctive buyer given all of the 19 attributes required to foretell it. This provides Firm X an higher hand in deciding the danger concerned with every of their incoming shoppers thus reducing the possibilities of loss. The web site just isn’t rendered publicly because of fees concerned in securing a cope with an online server. The scripts and information may be seen within the GitHub Repository.
Q&A Interface for Information Retrieval LLM
Additional, we prolonged this venture to implement a Massive Language Mannequin for extracting knowledge from the database. We obtained the Python API keys for making use of Google’s Gemini LLM mannequin utilizing the GooglePalm
class from the langchain
library. The LLM class object is used to transform human prompts to SQL queries.
The Python interface reads the data-related query in human language, passes it to the langchain
object that converts it to machine-readable language, passes it to SQLDatabaseChain
object that converts it right into a SQL question, then the SQLDatabase
framework connects MySQL Server to the Python software and executes the question by retrieving the specified knowledge.
Thus, we will get any info (Figure 14) from the database primarily based on the info attributes listed on the web page for reference. Any questions past the scope of the info will throw an error. All codes and information are on GitHub.
Conclusion
This Information Science Venture leveraged by Firm X for the prediction of Buyer Lifetime Worth of recent prospects can achieve the next outcomes —
- Improved Buyer Retention: Establish high-value prospects for focused promotional provides and loyalty packages, resulting in elevated buyer retention and decreased churn charges.
- Enhanced Advertising Effectiveness: Allow data-driven allocation of promoting sources in the direction of high-value buyer segments, maximizing return on funding.
- Information-driven Choice-Making: Empower stakeholders with CLV insights and interactive visualizations, facilitating knowledgeable selections relating to buyer acquisition, retention, and general enterprise technique.
- Enhanced Person Expertise: Present a user-friendly Q&A interface for simple entry to info, selling knowledge democratization and data sharing throughout the group decreasing the time required for creating environment friendly SQL queries.
References
- Danao, M. (2023). What is Customer Lifetime Value (CLV)? Forbes Advisor.
- LangChain, I. (2024a). LLMs.
- LangChain, I. (2024b). SQL Database.
- Bhattacharyya, S. (2018). Ridge and Lasso Regression: L1 and L2 Regularization. In the direction of Information Science