Machine Learning Handling Missing Values | by Gerardo Perrucci | Jun, 2024

Coping with missing values is a crucial step in preparing info for machine learning. This tutorial provides examples of the way in which to deal with missing values using Python, specializing within the Pandas library. We’ll import the necessary libraries, be taught the information, and uncover quite a few methods to take care of missing values.

You can confirm the whole code throughout the Jupyter Notebook

We begin by importing the necessary libraries for our info manipulation and analysis duties.

import numpy as np
import pandas as pd

NumPy: A primary package deal deal for scientific computing in Python. It provides assist for arrays, matrices, and fairly just a few mathematical options.
Pandas: A sturdy info manipulation and analysis library that provides info buildings and options needed to work with structured info seamlessly.

References:

We be taught the CSV file containing the NFL play-by-play info.

info = pd.read_csv("./NFLPlayByPlay2009-2017_v4.csv")

You can receive the dataset from the Kaggle website

By way of the import, a warning signifies that some columns have mixed info varieties. This can be addressed by specifying the dtype risk or setting low_memory=False.

Output:

/tmp/ipykernel_23803/1150844578.py:1: DtypeWarning: Columns (25,51) have mixed varieties. Specify dtype risk on import or set low_memory=False.
info = pd.read_csv("./NFLPlayByPlay2009-2017_v4.csv")

References:

To get a top level view of the information format, we look at the first few rows of the dataframe.

info.head()

Output:

Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  PlayTimeDiff  SideofField  ...  yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA  airWPA  yacWPA  Season
0  2009-09-10  2009091000      1    1   NaN  15:00         15     3600           0.0          TEN  ...     NaN     0.485675     0.514325     0.546433     0.453567  0.485675  0.060758     NaN     NaN    2009
1  2009-09-10  2009091000      1    1   1.0  14:53         15     3593           7.0          PIT  ...  1.146076     0.546433     0.453567     0.551088     0.448912  0.546433  0.004655 -0.032244  0.036899    2009
2  2009-09-10  2009091000      1    1   2.0  14:16         15     3556          37.0          PIT  ...     NaN     0.551088     0.448912     0.510793     0.489207  0.551088 -0.040295     NaN     NaN    2009
3  2009-09-10  2009091000      1    1   3.0  13:35         14     3515          41.0          PIT  ... -5.031425     0.510793     0.489207     0.461217     0.538783  0.510793 -0.049576  0.106663 -0.156239    2009
4  2009-09-10  2009091000      1    1   4.0  13:27         14     3507           8.0          PIT  ...     NaN     0.461217     0.538783     0.558929     0.441071  0.461217  0.097712     NaN     NaN    2009

We calculate the number of missing values in each column.

missing_values_per_column = info.isnull().sum()
missing_values_per_column[0:10] # taking the first ten columns

Output:

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

To know the proportion of missing info, we calculate the entire number of cells and the share of missing values.

total_cells = np.product(info.kind)
total_missing = missing_values_per_column.sum()

print('total_missing', total_missing)
print('total_cells', total_cells)
print('proportion missing', (total_missing / total_cells) * 100)

Output:

total_missing 11505187
total_cells 41584176
proportion missing 27.66722370547874

We are going to moreover rely the non-missing values in each column.

info[:].rely()

Output:

Date        407688
GameID      407688
Drive       407688
qtr         407688
down        346534
...
Win_Prob    382679
WPA         402147
airWPA      159187
yacWPA      158926
Season      407688
Measurement: 102, dtype: int64

Whereas not likely helpful, one answer to take care of missing values is to remove rows that embody them.

removed_rows_empty_data = info.dropna()
print(removed_rows_empty_data)

Output:

Empty DataFrame
Columns: [Date, GameID, Drive, qtr, down, time, TimeUnder, TimeSecs, PlayTimeDiff, SideofField, yrdln, yrdline100, ydstogo, ydsnet, GoalToGo, FirstDown, posteam, DefensiveTeam, desc, PlayAttempted, Yards.Gained, sp, Touchdown, ExPointResult, TwoPointConv, DefTwoPoint, Safety, Onsidekick, PuntResult, PlayType, Passer, Passer_ID, PassAttempt, PassOutcome, PassLength, AirYards, YardsAfterCatch, QBHit, PassLocation, InterceptionThrown, Interceptor, Rusher, Rusher_ID, RushAttempt, RunLocation, RunGap, Receiver, Receiver_ID, Reception, ReturnResult, Returner, BlockingPlayer, Tackler1, Tackler2, FieldGoalResult, FieldGoalDistance, Fumble, RecFumbTeam, RecFumbPlayer, Sack, Challenge.Replay, ChalReplayResult, Accepted.Penalty, PenalizedTeam, PenaltyType, PenalizedPlayer, Penalty.Yards, PosTeamScore, DefTeamScore, ScoreDiff, AbsScoreDiff, HomeTeam, AwayTeam, Timeout_Indicator, Timeout_Team, posteam_timeouts_pre, HomeTimeouts_Remaining_Pre, AwayTimeouts_Remaining_Pre, HomeTimeouts_Remaining_Post, AwayTimeouts_Remaining_Post, No_Score_Prob, Opp_Field_Goal_Prob, Opp_Safety_Prob, Opp_Touchdown_Prob, Field_Goal_Prob, Safety_Prob, Touchdown_Prob, ExPoint_Prob, TwoPoint_Prob, ExpPts, EPA, airEPA, yacEPA, Home_WP_pre, Away_WP_pre, Home_WP_post, Away_WP_post, Win_Prob, WPA, airWPA, yacWPA, ...]
Index: []
[0 rows x 102 columns]

A further widespread technique is to remove columns that embody missing values.

removed_columns_empty_data = info.dropna(axis=1)
print(removed_columns_empty_data)

Output:

Date      GameID  Drive  qtr  TimeUnder  ydstogo  ydsnet  PlayAttempted  Yards.Gained  sp  ...  AwayTeam  Timeout_Indicator  posteam_timeouts_pre HomeTimeouts_Remaining_Pre  AwayTimeouts_Remaining_Pre  HomeTimeouts_Remaining_Post  AwayTimeouts_Remaining_Post  ExPoint_Prob  TwoPoint_Prob  Season
0      2009-09-10  2009091000      1    1         15        0       0               1            39   0  ...       TEN                  0                        3                          3                          3                            3                            3           0.0            0.0    2009
1      2009-09-10  2009091000      1    1         15       10       5

We then calculate the affect of this operation by evaluating the number of columns sooner than and after.

print("distinctive columns: %d n" % info.kind[1])
print("cleaned columns: %d n" % removed_columns_empty_data.kind[1])

Output:

distinctive columns: 102
cleaned columns: 37

To take care of a smaller portion of the dataset, we are going to create a subset.

subset_nfl_data = info.loc[:, 'EPA':'Season'].head()
subset_nfl_data

Output:

EPA    airEPA    yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season
0  2.014474       NaN       NaN     0.485675     0.514325     0.546433     0.453567  0.485675  0.060758       NaN       NaN    2009
1  0.077907 -1.068169  1.146076     0.546433     0.453567     0.551088     0.448912  0.546433  0.004655 -0.032244  0.036899    2009
2 -1.402760       NaN       NaN     0.551088     0.448912     0.510793     0.489207  0.551088 -0.040295       NaN       NaN    2009
3 -1.712583  3.318841 -5.031425     0.510793     0.489207     0.461217     0.538783  0.510793 -0.049576  0.106663 -0.156239    2009
4  2.097796       NaN       NaN     0.461217     0.538783     0.558929     0.441071  0.461217  0.097712       NaN       NaN    2009

A easy methodology for coping with missing values is to fill them with a specific price, equal to zero.

filled_basic_data = info.fillna(0)
filled_basic_data.head()

Output:

Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  PlayTimeDiff  SideofField  ...    yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season
0  2009-09-10  2009091000      1    1   0.0  15:00         15     3600           0.0          TEN  ...  0.000000     0.485675     0.514325     0.546433     0.453567  0.485675  0.060758  0.000000  0.000000    2009
1  2009-09-10  2009091000      1    1   1.0  14:53         15     3593           7.0          PIT  ...  1.146076     0.546433     0.453567     0.551088     0.448912  0.546433  0.004655 -0.032244  0.036899    2009
2  2009-09-10  2009091000      1    1   2.0  14:16         15     3556          37.0          PIT  ...  0.000000     0.551088     0.448912     0.510793     0.489207  0.551088 -0.040295  0.000000  0.000000    2009
3  2009-09-10  2009091000      1    1   3.0  13:35         14     3515          41.0          PIT  ... -5.031425     0.510793     0.489207     0.461217     0.538783  0.510793 -0.049576  0.106663 -0.156239    2009
4  2009-09-10  2009091000      1    1   4.0  13:27         14     3507           8.0          PIT  ...  0.000000     0.461217     0.538783     0.558929     0.441071  0.461217  0.097712  0.000000  0.000000    2009

One different technique is to fill missing values based on the next respectable assertion throughout the column.

column_based_fill = info.bfill(axis=0).fillna(0)
column_based_fill.head()

Output:

Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  PlayTimeDiff  SideofField  ...    yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season
0  2009-09-10  2009091000      1    1   1.0  15:00         15     3600           0.0          TEN  ...  1.146076     0.485675     0.514325     0.546433

References:

These steps current an entire info to determining and coping with missing values in a dataset, making sure the information is ready for analysis and modeling. Each methodology has its professionals and cons, and the choice of methodology will rely upon the exact context and requirements of your analysis.

This pocket ebook is an prepare throughout the Data Cleaning course. You can reference the tutorial at this link.

Source link

Machine Learning Handling Missing Values | by Gerardo Perrucci | Jun, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

How Real-Time Data Analytics and AI Are Transforming Heavy Equipment Operations

NVIDIA Accelerates Google Quantum AI Processor Design With Simulation of Quantum Device Physics

Game Development and Cloud Computing: Benefits of Cloud-Native Game Servers

Teradata AI Unlimited in Microsoft Fabric is Now Available for Public Preview through Microsoft Fabric Workload Hub

Cognigy Unveils Agentic AI: Transforming the Future of Enterprise Contact Centers

Our Picks

The Future of Data Science and Machine Learning: Unlocking the Power of Insights | by Naman Bansal | Jul, 2024

Working with Vlasov-Poisson method part6(Machine Learning 2024) – Monodeep Mukherjee

FastAPI 101 — Part 2: Concurrency, Parallelism and Async / Await | by Christian Guerra | Jul, 2024

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Machine Learning Handling Missing Values | by Gerardo Perrucci | Jun, 2024

Related Posts