How do I prepare data for the Machine Learning Model? Make it all numbers! | by Machine Learning Maverick | Jun, 2024

On this article, I’ll go deeper into the step “The info” from the Machine Studying Workflow. Within the earlier article, How do I work with data using a Machine Learning Model? I described three of the six steps.

All the info for the Machine Studying Mannequin must be numerical. Making ready the info includes filling in lacking values, and altering all non-numerical values into numbers, e.g.: textual content into classes or integers, string dates cut up into days, months, and years as integers, and boolean sure/no as 0 and 1.

Within the earlier articles, I used to be working with the right information the place no preparation was wanted. In the true world good information doesn’t exist, you’ll at all times need to work with the info. This step is called Exploratory Data Analysis (EDA) is an important activity to conduct in the beginning of each information science undertaking.

1. Loading the info 
2. Coping with lacking values 
1. Determine lacking values 
2. Filling lacking values 
1. Numeric values 
2. Non-numeric values 
3. Convert non-numeric information into numeric 
1. Textual content into numbers 
2. Dates into numbers 
3. Classes into numbers 
4. The supply code

We have to load the info for Exploratory Information Evaluation (EDA). The info set used for this text is the “Residence Costs in Poland” — https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland/data.

# Importing the instruments
import pandas as pd# Load the info into pandas DataFrame
data_frame = pd.read_csv("apartments_rent_pl_2024_01.csv")
# Let's examine what information we have now
data_frame.head()

The primary 5 rows from the loaded information set

As we will see even within the first 5 rows we have now lacking values, within the type of empty cells.

First, we have to determine information varieties for every column within the loaded information set. We want numeric values, any information kind aside from object is sweet.

# Checking columns information varieties to know find out how to deal with lacking values
data_frame.dtypes

Under we have now a listing of column names and their information varieties, e.g. column id is of kind object.

id                       object
metropolis                     object
kind                     object
squareMeters            float64
rooms                   float64
flooring                   float64
floorCount              float64
buildYear               float64
latitude                float64
longitude               float64
centreDistance          float64
poiCount                float64
schoolDistance          float64
clinicDistance          float64
postOfficeDistance      float64
kindergartenDistance    float64
restaurantDistance      float64
collegeDistance         float64
pharmacyDistance        float64
possession                object
buildingMaterial         object
situation                object
hasParkingSpace           int64
hasBalcony                int64
hasElevator               int64
hasSecurity               int64
hasStorageRoom            int64
worth                     int64
dtype: object

Earlier than we even begin filling lacking values we have to know through which columns and what number of lacking values we have now.

# Checking information varieties vs NaN values - earlier than and after filling lacking information
info_df = pd.DataFrame({
'Information Kind': data_frame.dtypes,
'Lacking Values': data_frame.isna().sum()
})print(info_df)

Under we have now a listing of columns with its Information Kind and variety of Lacking Values for every column.

Information Kind  Lacking Values
id                      object               0
metropolis                    object               0
kind                    object            2203
squareMeters           float64               0
rooms                  float64               0
flooring                  float64            1030
floorCount             float64             171
buildYear              float64            2492
latitude               float64               0
longitude              float64               0
centreDistance         float64               0
poiCount               float64               0
schoolDistance         float64               2
clinicDistance         float64               5
postOfficeDistance     float64               5
kindergartenDistance   float64               7
restaurantDistance     float64              24
collegeDistance        float64             104
pharmacyDistance       float64              13
possession               object               0
buildingMaterial        object            3459
situation               object            6223
hasParkingSpace         object               0
hasBalcony              object               0
hasElevator             object             454
hasSecurity             object               0
hasStorageRoom          object               0
worth                    int64               0

As we will see we have now many lacking values, e.g. column buildYear has 2492 lacking values.

Once we attempt to create a Machine Studying Mannequin primarily based on the DataFrame …

X = data_frame.drop("worth", axis=1)
y = data_frame["price"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
np.random.seed(42)
from sklearn.ensemble import RandomForestClassifier
mannequin = RandomForestClassifier()
mannequin.match(X_train, y_train)

… we are going to get an exception.

ValueError                                Traceback (most up-to-date name final)<ipython-input-15-345ee3a9038d> in <cell line: 17>()
15 mannequin = RandomForestClassifier()
16 # 'match()' - Construct a forest of timber from the coaching set (X, y).
---> 17 mannequin.match(X_train, y_train)
18 # 'predict()' - Predict class for X.
19 y_preds = mannequin.predict(X_test)

/usr/native/lib/python3.10/dist-packages/pandas/core/generic.py in __array__(self, dtype)
1996     def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray:
1997         values = self._values
-> 1998         arr = np.asarray(values, dtype=dtype)
1999         if (
2000             astype_is_view(values.dtype, arr.dtype)ValueError: couldn't convert string to drift: '1e1ec12d582075085f740f5c7bdf4091'

Earlier than we create a Machine Studying Mannequin we have to fill in lacking values even when they’re numerics.

Numeric values

Filling lacking numeric columns with imply() values isn’t one of the best thought, however as a place to begin, it’s adequate.

For this activity, I’ve used two strategies fillna() and imply() for particular columns, e.g.: column flooring, data_frame[“floor”], and used the parameter inplace=True to keep away from reassigning worth to the column.

# Coping with lacking values
# Filling NaN values
data_frame["floor"].fillna(data_frame["floor"].imply(), inplace=True)
data_frame["floorCount"].fillna(data_frame["floorCount"].imply(), inplace=True)
data_frame["buildYear"].fillna(data_frame["buildYear"].imply(), inplace=True)# With out parameter inplate=True
# data_frame["buildYear"] = data_frame["buildYear"].fillna(data_frame["buildYear"].imply())

Non-numeric values

Once we cope with non-numeric values the worst factor we will do is to fill in lacking values with the identical worth. What do I imply by that?

First, I examine the distinctive values for particular column.

# Checking non-numeric columns distinctive information to fill NaN
print(f"Situation: {data_frame['condition'].distinctive()}")

Situation: ['premium' 'low']

We don’t need all our residences to be solely premium or low. Filling lacking values with a single worth is a really unhealthy thought.

That’s why I take advantage of the under code to search out distinctive values for particular columns after which randomly apply this worth to the column.

unique_conditions = data_frame["condition"].dropna().distinctive()
data_frame["condition"] = data_frame["condition"].apply(
lambda x: np.random.selection(unique_conditions) if pd.isna(x) else x)

The identical may be utilized to different columns, e.g.: metropolis, for its values.

Cities: [‘szczecin’ ‘gdynia’ ‘krakow’ ‘poznan’ ‘bialystok’ ‘gdansk’ ‘wroclaw’ ‘radom’ ‘rzeszow’ ‘lodz’ ‘katowice’ ‘lublin’ ‘czestochowa’ ‘warszawa’ ‘bydgoszcz’]

Since we have now crammed in all lacking values we will begin changing them into numbers as a result of ALL the info for the Machine Studying Mannequin must be numerical.

Generally it’s straightforward to alter textual content into numbers. Within the information set, we use, the id column comprises textual content 2a1a6db97ff122d6bc148abb6f0e498a, on this case, we will change it into the quantity hexadecimal type. The identical goes with boolean values like sure/no, we will convert them into 0 and 1.

# Convert non-numeric information into numeric# id column kind 'str' into 'int'
data_frame["id"] = data_frame["id"].apply(
lambda x: int(x, 16) if isinstance(x, str) else x)
# columns with 'str' sure/no into bool
data_frame['hasParkingSpace'] = 
data_frame['hasParkingSpace'].map({'sure': 1, "no": 0})

Even dates are saved in textual content type, e.g.: 2024–06–10, we have to cut up every half, the yr, the month, and the day into separate variables/columns. The column is in a special information set city_rentals_wro_2007_2023.csv from the identical “Residence Costs in Poland” — https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland/data.

# Convert non-numeric information into numeric
# altering column 'date_listed' of kind 'str' into separate numbers
data_frame['date_listed'] = pd.to_datetime(data_frame['date_listed'])# create new columns for yr, month, and day
data_frame['year'] = data_frame['date_listed'].dt.yr
data_frame['month'] = data_frame['date_listed'].dt.month
data_frame['day'] = data_frame['date_listed'].dt.day
# drop the unique 'date' column if you want
data_frame = data_frame.drop('date_listed', axis=1)

IMG

The above desk reveals added columns after changing the column date_listed.

Textual content information may be became classes after which into numbers, the under code does it very effectively. I’m not going into many particulars, I take advantage of current libraries and their lessons the OneHotEncoder and the ColumnTransformer, all accessible in scikit-learn.

How did I determine which column could also be handled as a class? It’s associated to the method described earlier Filling lacking values — Non-numeric values, and it’s part of the Exploratory Data Analysis (EDA).

Columns added after changing column ‘date_listed’.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer# Flip the classes into numbers
categorical_features = ["city", "type", "ownership", "buildingMaterial", "condition"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,
categorical_features)],
the rest="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_df = pd.DataFrame(transformed_X)

Earlier than remodeling the DataFrame we had 28 columns now we have now 44 columns with out human-readable column names, as a substitute we have now solely numbers as column names.

ALL the info is numerical, we completed EDA and ended up with the DataFrame prepared for use within the Machine Studying Mannequin.

Under we will discover all of the supply code obligatory for getting ready the info for utilizing it with a Machine Studying Mannequin.

Steps lined:

1. Loading the info 
2. Coping with lacking values 
3. Determine lacking values 
4. Filling lacking values 
◦ Numeric values 
◦ Non-numeric values 
5. Convert non-numeric information into numeric 
6. Textual content into numbers 
7. Dates into numbers 
8. Classes into numbers

# Importing the instruments
import pandas as pd
import numpy as npdata_frame = pd.read_csv(csv_file_name)
# Coping with lacking values
# Filling NaN values
data_frame["floor"].fillna(data_frame["floor"].imply(), inplace=True)
data_frame["floorCount"].fillna(data_frame["floorCount"].imply(), inplace=True)
data_frame["buildYear"].fillna(data_frame["buildYear"].imply(), inplace=True)
data_frame["schoolDistance"].fillna(data_frame["schoolDistance"].imply(), inplace=True)
data_frame["clinicDistance"].fillna(data_frame["clinicDistance"].imply(), inplace=True)
data_frame["postOfficeDistance"].fillna(data_frame["postOfficeDistance"].imply(), inplace=True)
data_frame["kindergartenDistance"].fillna(data_frame["kindergartenDistance"].imply(), inplace=True)
data_frame["restaurantDistance"].fillna(data_frame["restaurantDistance"].imply(), inplace=True)
data_frame["collegeDistance"].fillna(data_frame["collegeDistance"].imply(), inplace=True)
data_frame["pharmacyDistance"].fillna(data_frame["pharmacyDistance"].imply(), inplace=True)
unique_types = data_frame["type"].dropna().distinctive()
data_frame["type"] = data_frame["type"].apply(lambda x: np.random.selection(unique_types) if pd.isna(x) else x)
data_frame["ownership"].fillna("condominium", inplace=True)
unique_bms = data_frame["buildingMaterial"].dropna().distinctive()
data_frame["buildingMaterial"] = data_frame["buildingMaterial"].apply(
lambda x: np.random.selection(unique_bms) if pd.isna(x) else x)
unique_conditions = data_frame["condition"].dropna().distinctive()
data_frame["condition"] = data_frame["condition"].apply(
lambda x: np.random.selection(unique_conditions) if pd.isna(x) else x)
unique_hes = data_frame["hasElevator"].dropna().distinctive()
data_frame["hasElevator"] = data_frame["hasElevator"].apply(
lambda x: np.random.selection(unique_hes) if pd.isna(x) else x)
# Convert non-numeric information into numeric
# id column kind 'str' into 'int'
data_frame["id"] = data_frame["id"].apply(lambda x: int(x, 16) if isinstance(x, str) else x)
# columns with 'str' sure/no into bool
data_frame['hasParkingSpace'] = data_frame['hasParkingSpace'].map({'sure': 1, "no": 0})
data_frame['hasBalcony'] = data_frame['hasBalcony'].map({'sure': 1, "no": 0})
data_frame['hasElevator'] = data_frame['hasElevator'].map({'sure': 1, "no": 0})
data_frame['hasSecurity'] = data_frame['hasSecurity'].map({'sure': 1, "no": 0})
data_frame['hasStorageRoom'] = data_frame['hasStorageRoom'].map({'sure': 1, "no": 0})
# X - coaching enter samples, options
X = data_frame.drop("worth", axis=1)
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# Flip the classes into numbers
categorical_features = ["city", "type", "ownership", "buildingMaterial", "condition"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,
categorical_features)],
the rest="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_df = pd.DataFrame(transformed_X)
transformed_df.to_csv("saved_transformed_df.csv")
# y - coaching enter labels, the specified end result, the goal worth
y = data_frame["price"]

Under we will discover all of the supply code obligatory for making a Machine Studying Mannequin primarily based on the ready information.

# Import 'train_test_split()' operate
# "Break up arrays or matrices into random practice and check subsets."
from sklearn.model_selection import train_test_split# Break up the info into coaching and check units
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
# Setup random seed - to have the identical outcomes, me and also you
np.random.seed(42)
# Import the LinearRegression estimator class
from sklearn.linear_model import LinearRegression
# Instantiate LinearRegression to create a Machine Studying Mannequin
mannequin = LinearRegression()
# 'match()' - Construct a forest of timber from the coaching set (X, y).
mannequin.match(X_train, y_train)
# 'predict()' - Predict class for X.
y_preds = mannequin.predict(X_test)

NOTE: On this article, I’m simply barely scratching the floor. This matter wants extra studying and analysis by yourself. I’m nonetheless in the beginning of my studying technique of AI & ML!

Picture generated with Midjourney, edited in GIMP. Screenshots made by writer.

Source link

How do I prepare data for the Machine Learning Model? Make it all numbers! | by Machine Learning Maverick | Jun, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

How to Assist Human Agents & Transform Customer Experience with Conversational AI?

Salesforce Introduces Agentforce Testing Center: AI Agent Lifecycle Management Tooling for Testing Autonomous AI Agents at Scale

70% of Firms Disrupted by AI: New Endava Research

How Real-Time Data Analytics and AI Are Transforming Heavy Equipment Operations

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Our Picks

صباغ الكويت شاطر: حلول عملية لمنزلك | by Misrdr Info | Jul, 2024

The tech industry can’t agree on what open source AI means. That’s a problem.

Dynamics of Gaussian processes part3(Machine Learning 2024) | by Monodeep Mukherjee | May, 2024

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

How do I prepare data for the Machine Learning Model? Make it all numbers! | by Machine Learning Maverick | Jun, 2024

Numeric values

Non-numeric values

Related Posts