Beginner-Friendly Machine Learning Regression Analysis Portfolio Project: A Step-by-Step Guide | by Joy Kimaiyo | May, 2024

We are going to use the Toyota Corolla CSV dataset for this machine studying regression evaluation challenge. Regression evaluation is a type of predictive modeling approach that investigates the connection between a dependent (goal) and impartial variable (s) (predictor). It’s used to foretell steady variables. For example, on this challenge, we shall be predicting the worth of the automotive (Toyota Corolla) based mostly on its ‘Age’, ‘KM’, ‘FuelType’, ‘HP’, ‘MetColor’, ‘Automated’, ‘CC’,
‘Doorways’, ‘Weight’.

The info used might be discovered at GitHub.

This dataset is clear and doesn’t comprise null values or pointless columns.

First, we’ll import the mandatory modules:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Pandas: For analyzing, cleansing, exploring, and manipulating information

Numpy: For working with arrays

Matplotlib and Seaborn: For visualization

Subsequent, we’ll import the dataset:

df = pd.read_csv(r"C:UsersJOYDownloadsToyotaCorolla.csv")
df.head() #reads the 5 high rows of the info

Exploratory Knowledge Evaluation

EDA helps us decide how finest to control information sources to get the solutions we want, making it simpler for information scientists to find patterns, spot anomalies, check hypotheses, or verify assumptions.

For exploratory information evaluation (EDA), we’ll verify for null values to verify the dataset’s cleanliness:

df.isnull().sum()  # there are 0 null values

Then, we’ll generate abstract statistics to realize invaluable insights into the dataset:

df.describe()

We may even carry out exploratory evaluation via visualization by investigating for distribution:

for column in ['Price', 'Age', 'KM', 'HP', 'MetColor', 'Automatic', 'CC', 'Doors', 'Weight']:
plt.determine()
df[column].hist()
plt.title(column)
plt.present()

Output:

We may even visualize the distribution of the specific column ‘FuelType’ column utilizing a pie chart:

fuel_type_counts = df['FuelType'].value_counts()
plt.determine(figsize=(10,6))
plt.pie(fuel_type_counts, labels=fuel_type_counts.index, autopct='%1.1f%%')
plt.title('FuelType')
plt.present()

Encoding Categorical Knowledge

We are going to encode the specific information. Since most machine studying fashions solely settle for numerical variables, preprocessing the specific variables turns into a vital step. We are going to use one-hot encoding for the ‘FuelType’ column after which drop the unique column and one dummy column:

from sklearn.preprocessing import OneHotEncoderencoder = OneHotEncoder(sparse_output=False, drop='first').set_output(remodel="pandas")
cat_encoded = encoder.fit_transform(df[['FuelType']])
df1 = pd.concat([df, cat_encoded], axis=1)
df1.drop(['FuelType'], axis=1, inplace=True)
df1.head()

In encoding the specific column we get three columns. We used drop first as a result of the dummy variables embody redundant info, we now have an ’n’ variety of classes, and we’ll drop one newly created column(dummy variable) and might use ‘n−1’ dummy variables. It will stop the dummy variable entice.

Defining x and y and Splitting the info

Now, let’s outline our impartial and dependent variables. The impartial variable is the trigger, whereas the dependent variable is the impact:

X = df1[['Age', 'KM', 'HP', 'MetColor', 'Automatic', 'CC', 'Doors', 'Weight', 'FuelType_Diesel', 'FuelType_Petrol']]
y = df1[['Price']]

Linearity

We need to use one variable as a predictor or explanatory variable to clarify the opposite variable, the response or dependent variable. To do that, we want an excellent relationship between our two variables. nevertheless, the impartial variables mustn’t be too extremely correlated, a situation referred to as multicollinearity. This may be checked utilizing: Correlation matrices, the place correlation coefficients ought to ideally be beneath 0.80 that is based on NCBI.

In essence, there must be some relationships

correlation_matrix = df1[['Price', 'Age', 'KM', 'HP', 'MetColor', 'Automatic', 'CC', 'Doors', 'Weight', 'FuelType_Diesel', 'FuelType_Petrol']].corr()print(correlation_matrix) 
plt.determine(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix Heatmap')
plt.present()

Output:

We are going to break up our information into coaching and check units to detect a machine studying mannequin’s habits utilizing observations that aren’t used within the coaching course of:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Modeling

We import, outline and match our fashions. We shall be utilizing Linear Regression, Determination Tree, Help Vector Machine, and Random Forest Regressor:

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics# Outline our fashions
svm = SVR()
rf = RandomForestRegressor()
dt = DecisionTreeRegressor()
lr = LinearRegression()
# Match the fashions
svm.match(X_train, y_train)
rf.match(X_train, y_train)
dt.match(X_train, y_train)
lr.match(X_train, y_train)

Then, we’ll predict utilizing these fashions:

svm_predictions = svm.predict(X_test)
rf_predictions = rf.predict(X_test)
dt_predictions = dt.predict(X_test)
lr_predictions = lr.predict(X_test)

Let’s visualize the predictions vs precise values for simpler understanding:

fig, axs = plt.subplots(2, 2, figsize=(15, 10))# Determination Tree predictions
axs[0, 0].scatter(y_test, dt_predictions, colour='pink')
axs[0, 0].set_title('Determination Tree Predictions')
axs[0, 0].set_xlabel('Precise values')
axs[0, 0].set_ylabel('Predicted values')
# Linear Regression predictions
axs[0, 1].scatter(y_test, lr_predictions, colour='inexperienced')
axs[0, 1].set_title('Linear Regression Predictions')
axs[0, 1].set_xlabel('Precise values')
axs[0, 1].set_ylabel('Predicted values')
# Random Forest predictions
axs[1, 0].scatter(y_test, rf_predictions, colour='blue')
axs[1, 0].set_title('Random Forest Predictions')
axs[1, 0].set_xlabel('Precise values')
axs[1, 0].set_ylabel('Predicted values')
# SVM predictions
axs[1, 1].scatter(y_test, svm_predictions, colour='yellow')
axs[1, 1].set_title('SVM Predictions')
axs[1, 1].set_xlabel('Precise values')
axs[1, 1].set_ylabel('Predicted values')  
plt.tight_layout()
plt.present()

Output:

Fashions Analysis

Lastly, we’ll consider the fashions:

Within the predictions part, we consider the efficiency of our fashions utilizing the Root Imply Squared Error (RMSE). RMSE measures the typical distinction between the precise values and the expected values, with a decrease RMSE indicating higher mannequin efficiency. The Root Imply Squared Error (RMSE) is without doubt one of the two important efficiency indicators for a regression mannequin. It measures the typical distinction between values predicted by a mannequin and the precise values. It estimates how properly the mannequin can predict the goal worth. geeksforgeeks (accuracy)

print("SVM RMSE:", metrics.mean_squared_error(y_test, svm_predictions, squared=False))
print("Random Forest RMSE:", metrics.mean_squared_error(y_test, rf_predictions, squared=False))
print("Determination Tree RMSE:", metrics.mean_squared_error(y_test, dt_predictions, squared=False))

Output:

SVM RMSE: 3753.8240069371495

Random Forest RMSE: 1266.9954324165392

Determination Tree RMSE: 1610.0079966040294

Linear Regression RMSE: 1422.6439130179815

The Random Forest mannequin has the bottom RMSE, indicating it makes probably the most correct predictions among the many fashions we used. By combining the a number of timber, random forest regression reduces the reliance on anyone tree, resulting in much less overfitting on the coaching information. Averaging predictions from a number of timber typically ends in extra correct predictions than a single choice tree. The SVM mannequin then again has the best RMSE, indicating it makes the least correct predictions.

The whole Code is at https://github.com/JoyKimaiyo/Machine-Learning-Regression-Analysis-Portfolio-Project

Join with me at GitHub https://github.com/JoyKimaiyo

Source link

Beginner-Friendly Machine Learning Regression Analysis Portfolio Project: A Step-by-Step Guide | by Joy Kimaiyo | May, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

How Big Data Is Transforming Patient Care Delivery

How to Assist Human Agents & Transform Customer Experience with Conversational AI?

Salesforce Introduces Agentforce Testing Center: AI Agent Lifecycle Management Tooling for Testing Autonomous AI Agents at Scale

70% of Firms Disrupted by AI: New Endava Research

How Real-Time Data Analytics and AI Are Transforming Heavy Equipment Operations

Our Picks

How to integrate machine learning model(.h5 extension) using Flask application to perform image classification | by Pasindu Sandamal | Jun, 2024

Understanding Data Bias When Using AI or ML Models

The Role of Artificial Intelligence in Modern Investment Research

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Beginner-Friendly Machine Learning Regression Analysis Portfolio Project: A Step-by-Step Guide | by Joy Kimaiyo | May, 2024

Exploratory Knowledge Evaluation

Encoding Categorical Knowledge

Defining x and y and Splitting the info

Linearity

Modeling

Fashions Analysis

Related Posts