On this article, I’ll go deeper into the step “The info” from the Machine Studying Workflow. Within the earlier article, How do I work with data using a Machine Learning Model? I described three of the six steps.
All the info for the Machine Studying Mannequin must be numerical. Making ready the info includes filling in lacking values, and altering all non-numerical values into numbers, e.g.: textual content into classes or integers, string dates cut up into days, months, and years as integers, and boolean sure/no as 0 and 1.
Within the earlier articles, I used to be working with the right information the place no preparation was wanted. In the true world good information doesn’t exist, you’ll at all times need to work with the info. This step is called Exploratory Data Analysis (EDA) is an important activity to conduct in the beginning of each information science undertaking.
1. Loading the info
2. Coping with lacking values
1. Determine lacking values
2. Filling lacking values
1. Numeric values
2. Non-numeric values
3. Convert non-numeric information into numeric
1. Textual content into numbers
2. Dates into numbers
3. Classes into numbers
4. The supply code
We have to load the info for Exploratory Information Evaluation (EDA). The info set used for this text is the “Residence Costs in Poland” — https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland/data.
# Importing the instruments
import pandas as pd# Load the info into pandas DataFrame
data_frame = pd.read_csv("apartments_rent_pl_2024_01.csv")
# Let's examine what information we have now
data_frame.head()
As we will see even within the first 5 rows we have now lacking values, within the type of empty cells.
First, we have to determine information varieties for every column within the loaded information set. We want numeric values, any information kind aside from object is sweet.
# Checking columns information varieties to know find out how to deal with lacking values
data_frame.dtypes
Under we have now a listing of column names and their information varieties, e.g. column id is of kind object.
id object
metropolis object
kind object
squareMeters float64
rooms float64
flooring float64
floorCount float64
buildYear float64
latitude float64
longitude float64
centreDistance float64
poiCount float64
schoolDistance float64
clinicDistance float64
postOfficeDistance float64
kindergartenDistance float64
restaurantDistance float64
collegeDistance float64
pharmacyDistance float64
possession object
buildingMaterial object
situation object
hasParkingSpace int64
hasBalcony int64
hasElevator int64
hasSecurity int64
hasStorageRoom int64
worth int64
dtype: object
Earlier than we even begin filling lacking values we have to know through which columns and what number of lacking values we have now.
# Checking information varieties vs NaN values - earlier than and after filling lacking information
info_df = pd.DataFrame({
'Information Kind': data_frame.dtypes,
'Lacking Values': data_frame.isna().sum()
})print(info_df)
Under we have now a listing of columns with its Information Kind and variety of Lacking Values for every column.
Information Kind Lacking Values
id object 0
metropolis object 0
kind object 2203
squareMeters float64 0
rooms float64 0
flooring float64 1030
floorCount float64 171
buildYear float64 2492
latitude float64 0
longitude float64 0
centreDistance float64 0
poiCount float64 0
schoolDistance float64 2
clinicDistance float64 5
postOfficeDistance float64 5
kindergartenDistance float64 7
restaurantDistance float64 24
collegeDistance float64 104
pharmacyDistance float64 13
possession object 0
buildingMaterial object 3459
situation object 6223
hasParkingSpace object 0
hasBalcony object 0
hasElevator object 454
hasSecurity object 0
hasStorageRoom object 0
worth int64 0
As we will see we have now many lacking values, e.g. column buildYear has 2492 lacking values.
Once we attempt to create a Machine Studying Mannequin primarily based on the DataFrame …
X = data_frame.drop("worth", axis=1)
y = data_frame["price"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
np.random.seed(42)
from sklearn.ensemble import RandomForestClassifier
mannequin = RandomForestClassifier()
mannequin.match(X_train, y_train)
… we are going to get an exception.
ValueError Traceback (most up-to-date name final)<ipython-input-15-345ee3a9038d> in <cell line: 17>()
15 mannequin = RandomForestClassifier()
16 # 'match()' - Construct a forest of timber from the coaching set (X, y).
---> 17 mannequin.match(X_train, y_train)
18 # 'predict()' - Predict class for X.
19 y_preds = mannequin.predict(X_test)
/usr/native/lib/python3.10/dist-packages/pandas/core/generic.py in __array__(self, dtype)
1996 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray:
1997 values = self._values
-> 1998 arr = np.asarray(values, dtype=dtype)
1999 if (
2000 astype_is_view(values.dtype, arr.dtype)ValueError: couldn't convert string to drift: '1e1ec12d582075085f740f5c7bdf4091'
Earlier than we create a Machine Studying Mannequin we have to fill in lacking values even when they’re numerics.
Numeric values
Filling lacking numeric columns with imply() values isn’t one of the best thought, however as a place to begin, it’s adequate.
For this activity, I’ve used two strategies fillna() and imply() for particular columns, e.g.: column flooring, data_frame[“floor”], and used the parameter inplace=True to keep away from reassigning worth to the column.
# Coping with lacking values
# Filling NaN values
data_frame["floor"].fillna(data_frame["floor"].imply(), inplace=True)
data_frame["floorCount"].fillna(data_frame["floorCount"].imply(), inplace=True)
data_frame["buildYear"].fillna(data_frame["buildYear"].imply(), inplace=True)# With out parameter inplate=True
# data_frame["buildYear"] = data_frame["buildYear"].fillna(data_frame["buildYear"].imply())
Non-numeric values
Once we cope with non-numeric values the worst factor we will do is to fill in lacking values with the identical worth. What do I imply by that?
First, I examine the distinctive values for particular column.
# Checking non-numeric columns distinctive information to fill NaN
print(f"Situation: {data_frame['condition'].distinctive()}")
Situation: ['premium' 'low']
We don’t need all our residences to be solely premium or low. Filling lacking values with a single worth is a really unhealthy thought.
That’s why I take advantage of the under code to search out distinctive values for particular columns after which randomly apply this worth to the column.
unique_conditions = data_frame["condition"].dropna().distinctive()
data_frame["condition"] = data_frame["condition"].apply(
lambda x: np.random.selection(unique_conditions) if pd.isna(x) else x)
The identical may be utilized to different columns, e.g.: metropolis, for its values.
Cities: [‘szczecin’ ‘gdynia’ ‘krakow’ ‘poznan’ ‘bialystok’ ‘gdansk’ ‘wroclaw’ ‘radom’ ‘rzeszow’ ‘lodz’ ‘katowice’ ‘lublin’ ‘czestochowa’ ‘warszawa’ ‘bydgoszcz’]
Since we have now crammed in all lacking values we will begin changing them into numbers as a result of ALL the info for the Machine Studying Mannequin must be numerical.
Generally it’s straightforward to alter textual content into numbers. Within the information set, we use, the id column comprises textual content 2a1a6db97ff122d6bc148abb6f0e498a, on this case, we will change it into the quantity hexadecimal type. The identical goes with boolean values like sure/no, we will convert them into 0 and 1.
# Convert non-numeric information into numeric# id column kind 'str' into 'int'
data_frame["id"] = data_frame["id"].apply(
lambda x: int(x, 16) if isinstance(x, str) else x)
# columns with 'str' sure/no into bool
data_frame['hasParkingSpace'] =
data_frame['hasParkingSpace'].map({'sure': 1, "no": 0})
Even dates are saved in textual content type, e.g.: 2024–06–10, we have to cut up every half, the yr, the month, and the day into separate variables/columns. The column is in a special information set city_rentals_wro_2007_2023.csv from the identical “Residence Costs in Poland” — https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland/data.
# Convert non-numeric information into numeric
# altering column 'date_listed' of kind 'str' into separate numbers
data_frame['date_listed'] = pd.to_datetime(data_frame['date_listed'])# create new columns for yr, month, and day
data_frame['year'] = data_frame['date_listed'].dt.yr
data_frame['month'] = data_frame['date_listed'].dt.month
data_frame['day'] = data_frame['date_listed'].dt.day
# drop the unique 'date' column if you want
data_frame = data_frame.drop('date_listed', axis=1)
IMG
The above desk reveals added columns after changing the column date_listed.
Textual content information may be became classes after which into numbers, the under code does it very effectively. I’m not going into many particulars, I take advantage of current libraries and their lessons the OneHotEncoder and the ColumnTransformer, all accessible in scikit-learn.
How did I determine which column could also be handled as a class? It’s associated to the method described earlier Filling lacking values — Non-numeric values, and it’s part of the Exploratory Data Analysis (EDA).
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer# Flip the classes into numbers
categorical_features = ["city", "type", "ownership", "buildingMaterial", "condition"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,
categorical_features)],
the rest="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_df = pd.DataFrame(transformed_X)
Earlier than remodeling the DataFrame we had 28 columns now we have now 44 columns with out human-readable column names, as a substitute we have now solely numbers as column names.
ALL the info is numerical, we completed EDA and ended up with the DataFrame prepared for use within the Machine Studying Mannequin.
Under we will discover all of the supply code obligatory for getting ready the info for utilizing it with a Machine Studying Mannequin.
Steps lined:
1. Loading the info
2. Coping with lacking values
3. Determine lacking values
4. Filling lacking values
◦ Numeric values
◦ Non-numeric values
5. Convert non-numeric information into numeric
6. Textual content into numbers
7. Dates into numbers
8. Classes into numbers
# Importing the instruments
import pandas as pd
import numpy as npdata_frame = pd.read_csv(csv_file_name)
# Coping with lacking values
# Filling NaN values
data_frame["floor"].fillna(data_frame["floor"].imply(), inplace=True)
data_frame["floorCount"].fillna(data_frame["floorCount"].imply(), inplace=True)
data_frame["buildYear"].fillna(data_frame["buildYear"].imply(), inplace=True)
data_frame["schoolDistance"].fillna(data_frame["schoolDistance"].imply(), inplace=True)
data_frame["clinicDistance"].fillna(data_frame["clinicDistance"].imply(), inplace=True)
data_frame["postOfficeDistance"].fillna(data_frame["postOfficeDistance"].imply(), inplace=True)
data_frame["kindergartenDistance"].fillna(data_frame["kindergartenDistance"].imply(), inplace=True)
data_frame["restaurantDistance"].fillna(data_frame["restaurantDistance"].imply(), inplace=True)
data_frame["collegeDistance"].fillna(data_frame["collegeDistance"].imply(), inplace=True)
data_frame["pharmacyDistance"].fillna(data_frame["pharmacyDistance"].imply(), inplace=True)
unique_types = data_frame["type"].dropna().distinctive()
data_frame["type"] = data_frame["type"].apply(lambda x: np.random.selection(unique_types) if pd.isna(x) else x)
data_frame["ownership"].fillna("condominium", inplace=True)
unique_bms = data_frame["buildingMaterial"].dropna().distinctive()
data_frame["buildingMaterial"] = data_frame["buildingMaterial"].apply(
lambda x: np.random.selection(unique_bms) if pd.isna(x) else x)
unique_conditions = data_frame["condition"].dropna().distinctive()
data_frame["condition"] = data_frame["condition"].apply(
lambda x: np.random.selection(unique_conditions) if pd.isna(x) else x)
unique_hes = data_frame["hasElevator"].dropna().distinctive()
data_frame["hasElevator"] = data_frame["hasElevator"].apply(
lambda x: np.random.selection(unique_hes) if pd.isna(x) else x)
# Convert non-numeric information into numeric
# id column kind 'str' into 'int'
data_frame["id"] = data_frame["id"].apply(lambda x: int(x, 16) if isinstance(x, str) else x)
# columns with 'str' sure/no into bool
data_frame['hasParkingSpace'] = data_frame['hasParkingSpace'].map({'sure': 1, "no": 0})
data_frame['hasBalcony'] = data_frame['hasBalcony'].map({'sure': 1, "no": 0})
data_frame['hasElevator'] = data_frame['hasElevator'].map({'sure': 1, "no": 0})
data_frame['hasSecurity'] = data_frame['hasSecurity'].map({'sure': 1, "no": 0})
data_frame['hasStorageRoom'] = data_frame['hasStorageRoom'].map({'sure': 1, "no": 0})
# X - coaching enter samples, options
X = data_frame.drop("worth", axis=1)
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# Flip the classes into numbers
categorical_features = ["city", "type", "ownership", "buildingMaterial", "condition"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,
categorical_features)],
the rest="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_df = pd.DataFrame(transformed_X)
transformed_df.to_csv("saved_transformed_df.csv")
# y - coaching enter labels, the specified end result, the goal worth
y = data_frame["price"]
Under we will discover all of the supply code obligatory for making a Machine Studying Mannequin primarily based on the ready information.
# Import 'train_test_split()' operate
# "Break up arrays or matrices into random practice and check subsets."
from sklearn.model_selection import train_test_split# Break up the info into coaching and check units
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
# Setup random seed - to have the identical outcomes, me and also you
np.random.seed(42)
# Import the LinearRegression estimator class
from sklearn.linear_model import LinearRegression
# Instantiate LinearRegression to create a Machine Studying Mannequin
mannequin = LinearRegression()
# 'match()' - Construct a forest of timber from the coaching set (X, y).
mannequin.match(X_train, y_train)
# 'predict()' - Predict class for X.
y_preds = mannequin.predict(X_test)
NOTE: On this article, I’m simply barely scratching the floor. This matter wants extra studying and analysis by yourself. I’m nonetheless in the beginning of my studying technique of AI & ML!
Picture generated with Midjourney, edited in GIMP. Screenshots made by writer.