We are going to use the Toyota Corolla CSV dataset for this machine studying regression evaluation challenge. Regression evaluation is a type of predictive modeling approach that investigates the connection between a dependent (goal) and impartial variable (s) (predictor). It’s used to foretell steady variables. For example, on this challenge, we shall be predicting the worth of the automotive (Toyota Corolla) based mostly on its ‘Age’, ‘KM’, ‘FuelType’, ‘HP’, ‘MetColor’, ‘Automated’, ‘CC’,
‘Doorways’, ‘Weight’.
The info used might be discovered at GitHub.
This dataset is clear and doesn’t comprise null values or pointless columns.
First, we’ll import the mandatory modules:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
Pandas: For analyzing, cleansing, exploring, and manipulating information
Numpy: For working with arrays
Matplotlib and Seaborn: For visualization
Subsequent, we’ll import the dataset:
df = pd.read_csv(r"C:UsersJOYDownloadsToyotaCorolla.csv")
df.head() #reads the 5 high rows of the info
Exploratory Knowledge Evaluation
EDA helps us decide how finest to control information sources to get the solutions we want, making it simpler for information scientists to find patterns, spot anomalies, check hypotheses, or verify assumptions.
For exploratory information evaluation (EDA), we’ll verify for null values to verify the dataset’s cleanliness:
df.isnull().sum() # there are 0 null values
Then, we’ll generate abstract statistics to realize invaluable insights into the dataset:
df.describe()
We may even carry out exploratory evaluation via visualization by investigating for distribution:
for column in ['Price', 'Age', 'KM', 'HP', 'MetColor', 'Automatic', 'CC', 'Doors', 'Weight']:
plt.determine()
df[column].hist()
plt.title(column)
plt.present()
Output:
We may even visualize the distribution of the specific column ‘FuelType’ column utilizing a pie chart:
fuel_type_counts = df['FuelType'].value_counts()
plt.determine(figsize=(10,6))
plt.pie(fuel_type_counts, labels=fuel_type_counts.index, autopct='%1.1f%%')
plt.title('FuelType')
plt.present()
Encoding Categorical Knowledge
We are going to encode the specific information. Since most machine studying fashions solely settle for numerical variables, preprocessing the specific variables turns into a vital step. We are going to use one-hot encoding for the ‘FuelType’ column after which drop the unique column and one dummy column:
from sklearn.preprocessing import OneHotEncoderencoder = OneHotEncoder(sparse_output=False, drop='first').set_output(remodel="pandas")
cat_encoded = encoder.fit_transform(df[['FuelType']])
df1 = pd.concat([df, cat_encoded], axis=1)
df1.drop(['FuelType'], axis=1, inplace=True)
df1.head()
In encoding the specific column we get three columns. We used drop first as a result of the dummy variables embody redundant info, we now have an ’n’ variety of classes, and we’ll drop one newly created column(dummy variable) and might use ‘n−1’ dummy variables. It will stop the dummy variable entice.
Defining x and y and Splitting the info
Now, let’s outline our impartial and dependent variables. The impartial variable is the trigger, whereas the dependent variable is the impact:
X = df1[['Age', 'KM', 'HP', 'MetColor', 'Automatic', 'CC', 'Doors', 'Weight', 'FuelType_Diesel', 'FuelType_Petrol']]
y = df1[['Price']]
Linearity
We need to use one variable as a predictor or explanatory variable to clarify the opposite variable, the response or dependent variable. To do that, we want an excellent relationship between our two variables. nevertheless, the impartial variables mustn’t be too extremely correlated, a situation referred to as multicollinearity. This may be checked utilizing: Correlation matrices, the place correlation coefficients ought to ideally be beneath 0.80 that is based on NCBI.
In essence, there must be some relationships
correlation_matrix = df1[['Price', 'Age', 'KM', 'HP', 'MetColor', 'Automatic', 'CC', 'Doors', 'Weight', 'FuelType_Diesel', 'FuelType_Petrol']].corr()print(correlation_matrix)
plt.determine(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix Heatmap')
plt.present()
Output:
We are going to break up our information into coaching and check units to detect a machine studying mannequin’s habits utilizing observations that aren’t used within the coaching course of:
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Modeling
We import, outline and match our fashions. We shall be utilizing Linear Regression, Determination Tree, Help Vector Machine, and Random Forest Regressor:
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics# Outline our fashions
svm = SVR()
rf = RandomForestRegressor()
dt = DecisionTreeRegressor()
lr = LinearRegression()
# Match the fashions
svm.match(X_train, y_train)
rf.match(X_train, y_train)
dt.match(X_train, y_train)
lr.match(X_train, y_train)
Then, we’ll predict utilizing these fashions:
svm_predictions = svm.predict(X_test)
rf_predictions = rf.predict(X_test)
dt_predictions = dt.predict(X_test)
lr_predictions = lr.predict(X_test)
Let’s visualize the predictions vs precise values for simpler understanding:
fig, axs = plt.subplots(2, 2, figsize=(15, 10))# Determination Tree predictions
axs[0, 0].scatter(y_test, dt_predictions, colour='pink')
axs[0, 0].set_title('Determination Tree Predictions')
axs[0, 0].set_xlabel('Precise values')
axs[0, 0].set_ylabel('Predicted values')
# Linear Regression predictions
axs[0, 1].scatter(y_test, lr_predictions, colour='inexperienced')
axs[0, 1].set_title('Linear Regression Predictions')
axs[0, 1].set_xlabel('Precise values')
axs[0, 1].set_ylabel('Predicted values')
# Random Forest predictions
axs[1, 0].scatter(y_test, rf_predictions, colour='blue')
axs[1, 0].set_title('Random Forest Predictions')
axs[1, 0].set_xlabel('Precise values')
axs[1, 0].set_ylabel('Predicted values')
# SVM predictions
axs[1, 1].scatter(y_test, svm_predictions, colour='yellow')
axs[1, 1].set_title('SVM Predictions')
axs[1, 1].set_xlabel('Precise values')
axs[1, 1].set_ylabel('Predicted values')
plt.tight_layout()
plt.present()
Output:
Fashions Analysis
Lastly, we’ll consider the fashions:
Within the predictions part, we consider the efficiency of our fashions utilizing the Root Imply Squared Error (RMSE). RMSE measures the typical distinction between the precise values and the expected values, with a decrease RMSE indicating higher mannequin efficiency. The Root Imply Squared Error (RMSE) is without doubt one of the two important efficiency indicators for a regression mannequin. It measures the typical distinction between values predicted by a mannequin and the precise values. It estimates how properly the mannequin can predict the goal worth. geeksforgeeks (accuracy)
print("SVM RMSE:", metrics.mean_squared_error(y_test, svm_predictions, squared=False))
print("Random Forest RMSE:", metrics.mean_squared_error(y_test, rf_predictions, squared=False))
print("Determination Tree RMSE:", metrics.mean_squared_error(y_test, dt_predictions, squared=False))
Output:
SVM RMSE: 3753.8240069371495
Random Forest RMSE: 1266.9954324165392
Determination Tree RMSE: 1610.0079966040294
Linear Regression RMSE: 1422.6439130179815
The Random Forest mannequin has the bottom RMSE, indicating it makes probably the most correct predictions among the many fashions we used. By combining the a number of timber, random forest regression reduces the reliance on anyone tree, resulting in much less overfitting on the coaching information. Averaging predictions from a number of timber typically ends in extra correct predictions than a single choice tree. The SVM mannequin then again has the best RMSE, indicating it makes the least correct predictions.
The whole Code is at https://github.com/JoyKimaiyo/Machine-Learning-Regression-Analysis-Portfolio-Project
Join with me at GitHub https://github.com/JoyKimaiyo