Within the discipline of machine studying, one of many essential steps in constructing a predictive mannequin is preprocessing the information. This course of includes dealing with lacking values, encoding categorical variables, and splitting the information into coaching and testing units. On this article, we are going to discover preprocess a dataset containing details about automobile gross sales utilizing Python’s Pandas and Scikit-learn libraries.
We begin by importing the required libraries and loading the dataset right into a Pandas DataFrame.
import pandas as pd
# Import the dataset
file = pd.read_csv("../datasets/Car_sales_missing.csv")
file = file.drop("Latest_Launch", axis=1)
file
Subsequent, we examine for lacking values within the dataset and fill them utilizing acceptable methods. We use the imply worth for numerical columns and a placeholder for categorical columns.
# Test for lacking values
file.isna().sum()# Fill lacking values
file["Vehicle_type"].fillna("lacking", inplace=True)
file["__year_resale_value"].fillna(file["__year_resale_value"].imply(), inplace=True)
file["Sales_in_thousands"].fillna(file["Sales_in_thousands"].imply(), inplace=True)
file["Price_in_thousands"].fillna(file["Price_in_thousands"].imply(), inplace=True)
file["Engine_size"].fillna(file["Engine_size"].imply(), inplace=True)
file["Horsepower"].fillna(file["Horsepower"].imply(), inplace=True)
file["Wheelbase"].fillna(file["Wheelbase"].imply(), inplace=True)
file["Width"].fillna(file["Width"].imply(), inplace=True)
file["Length"].fillna(file["Length"].imply(), inplace=True)
file["Curb_weight"].fillna(file["Curb_weight"].imply(), inplace=True)
file["Fuel_capacity"].fillna(file["Fuel_capacity"].imply(), inplace=True)
file["Fuel_efficiency"].fillna(file["Fuel_efficiency"].imply(), inplace=True)
file["Power_perf_factor"].fillna(file["Power_perf_factor"].imply(), inplace=True)
file.isna().sum()
After dealing with lacking values, we remodel the specific variables into numerical representations utilizing one-hot encoding.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# Outline categorical options for one-hot encoding
categorical_features = ["Manufacturer", "Vehicle_type"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)],
the rest="passthrough")# Remodel the information
transformed_x = transformer.fit_transform(x)
transformed_x
Lastly, we cut up the remodeled information into coaching and testing units to organize for mannequin coaching.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(transformed_x, y, test_size=0.2)
We then use a random forest regressor to construct a predictive mannequin and consider its efficiency.
from sklearn.ensemble import RandomForestRegressor
mannequin = RandomForestRegressor()
mannequin.match(x_train, y_train)
mannequin.rating(x_test, y_test)
On this article, we now have mentioned the significance of knowledge preprocessing in machine studying and demonstrated deal with lacking values, encode categorical variables, and cut up the information for coaching and testing. These preprocessing steps are important for constructing correct and dependable machine studying fashions.