Coping with missing values is a crucial step in preparing info for machine learning. This tutorial provides examples of the way in which to deal with missing values using Python, specializing within the Pandas library. We’ll import the necessary libraries, be taught the information, and uncover quite a few methods to take care of missing values.
You can confirm the whole code throughout the Jupyter Notebook
We begin by importing the necessary libraries for our info manipulation and analysis duties.
import numpy as np
import pandas as pd
- NumPy: A primary package deal deal for scientific computing in Python. It provides assist for arrays, matrices, and fairly just a few mathematical options.
- Pandas: A sturdy info manipulation and analysis library that provides info buildings and options needed to work with structured info seamlessly.
References:
We be taught the CSV file containing the NFL play-by-play info.
info = pd.read_csv("./NFLPlayByPlay2009-2017_v4.csv")
You can receive the dataset from the Kaggle website
By way of the import, a warning signifies that some columns have mixed info varieties. This can be addressed by specifying the dtype
risk or setting low_memory=False
.
Output:
/tmp/ipykernel_23803/1150844578.py:1: DtypeWarning: Columns (25,51) have mixed varieties. Specify dtype risk on import or set low_memory=False.
info = pd.read_csv("./NFLPlayByPlay2009-2017_v4.csv")
References:
To get a top level view of the information format, we look at the first few rows of the dataframe.
info.head()
Output:
Date GameID Drive qtr down time TimeUnder TimeSecs PlayTimeDiff SideofField ... yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Season
0 2009-09-10 2009091000 1 1 NaN 15:00 15 3600 0.0 TEN ... NaN 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 NaN NaN 2009
1 2009-09-10 2009091000 1 1 1.0 14:53 15 3593 7.0 PIT ... 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 2009-09-10 2009091000 1 1 2.0 14:16 15 3556 37.0 PIT ... NaN 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 NaN NaN 2009
3 2009-09-10 2009091000 1 1 3.0 13:35 14 3515 41.0 PIT ... -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2009-09-10 2009091000 1 1 4.0 13:27 14 3507 8.0 PIT ... NaN 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 NaN NaN 2009
We calculate the number of missing values in each column.
missing_values_per_column = info.isnull().sum()
missing_values_per_column[0:10] # taking the first ten columns
Output:
Date 0
GameID 0
Drive 0
qtr 0
down 61154
time 224
TimeUnder 0
TimeSecs 224
PlayTimeDiff 444
SideofField 528
dtype: int64
To know the proportion of missing info, we calculate the entire number of cells and the share of missing values.
total_cells = np.product(info.kind)
total_missing = missing_values_per_column.sum()
print('total_missing', total_missing)
print('total_cells', total_cells)
print('proportion missing', (total_missing / total_cells) * 100)
Output:
total_missing 11505187
total_cells 41584176
proportion missing 27.66722370547874
We are going to moreover rely the non-missing values in each column.
info[:].rely()
Output:
Date 407688
GameID 407688
Drive 407688
qtr 407688
down 346534
...
Win_Prob 382679
WPA 402147
airWPA 159187
yacWPA 158926
Season 407688
Measurement: 102, dtype: int64
Whereas not likely helpful, one answer to take care of missing values is to remove rows that embody them.
removed_rows_empty_data = info.dropna()
print(removed_rows_empty_data)
Output:
Empty DataFrame
Columns: [Date, GameID, Drive, qtr, down, time, TimeUnder, TimeSecs, PlayTimeDiff, SideofField, yrdln, yrdline100, ydstogo, ydsnet, GoalToGo, FirstDown, posteam, DefensiveTeam, desc, PlayAttempted, Yards.Gained, sp, Touchdown, ExPointResult, TwoPointConv, DefTwoPoint, Safety, Onsidekick, PuntResult, PlayType, Passer, Passer_ID, PassAttempt, PassOutcome, PassLength, AirYards, YardsAfterCatch, QBHit, PassLocation, InterceptionThrown, Interceptor, Rusher, Rusher_ID, RushAttempt, RunLocation, RunGap, Receiver, Receiver_ID, Reception, ReturnResult, Returner, BlockingPlayer, Tackler1, Tackler2, FieldGoalResult, FieldGoalDistance, Fumble, RecFumbTeam, RecFumbPlayer, Sack, Challenge.Replay, ChalReplayResult, Accepted.Penalty, PenalizedTeam, PenaltyType, PenalizedPlayer, Penalty.Yards, PosTeamScore, DefTeamScore, ScoreDiff, AbsScoreDiff, HomeTeam, AwayTeam, Timeout_Indicator, Timeout_Team, posteam_timeouts_pre, HomeTimeouts_Remaining_Pre, AwayTimeouts_Remaining_Pre, HomeTimeouts_Remaining_Post, AwayTimeouts_Remaining_Post, No_Score_Prob, Opp_Field_Goal_Prob, Opp_Safety_Prob, Opp_Touchdown_Prob, Field_Goal_Prob, Safety_Prob, Touchdown_Prob, ExPoint_Prob, TwoPoint_Prob, ExpPts, EPA, airEPA, yacEPA, Home_WP_pre, Away_WP_pre, Home_WP_post, Away_WP_post, Win_Prob, WPA, airWPA, yacWPA, ...]
Index: []
[0 rows x 102 columns]
A further widespread technique is to remove columns that embody missing values.
removed_columns_empty_data = info.dropna(axis=1)
print(removed_columns_empty_data)
Output:
Date GameID Drive qtr TimeUnder ydstogo ydsnet PlayAttempted Yards.Gained sp ... AwayTeam Timeout_Indicator posteam_timeouts_pre HomeTimeouts_Remaining_Pre AwayTimeouts_Remaining_Pre HomeTimeouts_Remaining_Post AwayTimeouts_Remaining_Post ExPoint_Prob TwoPoint_Prob Season
0 2009-09-10 2009091000 1 1 15 0 0 1 39 0 ... TEN 0 3 3 3 3 3 0.0 0.0 2009
1 2009-09-10 2009091000 1 1 15 10 5
We then calculate the affect of this operation by evaluating the number of columns sooner than and after.
print("distinctive columns: %d n" % info.kind[1])
print("cleaned columns: %d n" % removed_columns_empty_data.kind[1])
Output:
distinctive columns: 102
cleaned columns: 37
To take care of a smaller portion of the dataset, we are going to create a subset.
subset_nfl_data = info.loc[:, 'EPA':'Season'].head()
subset_nfl_data
Output:
EPA airEPA yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Season
0 2.014474 NaN NaN 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 NaN NaN 2009
1 0.077907 -1.068169 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 -1.402760 NaN NaN 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 NaN NaN 2009
3 -1.712583 3.318841 -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2.097796 NaN NaN 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 NaN NaN 2009
A easy methodology for coping with missing values is to fill them with a specific price, equal to zero.
filled_basic_data = info.fillna(0)
filled_basic_data.head()
Output:
Date GameID Drive qtr down time TimeUnder TimeSecs PlayTimeDiff SideofField ... yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Season
0 2009-09-10 2009091000 1 1 0.0 15:00 15 3600 0.0 TEN ... 0.000000 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 0.000000 0.000000 2009
1 2009-09-10 2009091000 1 1 1.0 14:53 15 3593 7.0 PIT ... 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 2009-09-10 2009091000 1 1 2.0 14:16 15 3556 37.0 PIT ... 0.000000 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 0.000000 0.000000 2009
3 2009-09-10 2009091000 1 1 3.0 13:35 14 3515 41.0 PIT ... -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2009-09-10 2009091000 1 1 4.0 13:27 14 3507 8.0 PIT ... 0.000000 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 0.000000 0.000000 2009
One different technique is to fill missing values based on the next respectable assertion throughout the column.
column_based_fill = info.bfill(axis=0).fillna(0)
column_based_fill.head()
Output:
Date GameID Drive qtr down time TimeUnder TimeSecs PlayTimeDiff SideofField ... yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Season
0 2009-09-10 2009091000 1 1 1.0 15:00 15 3600 0.0 TEN ... 1.146076 0.485675 0.514325 0.546433
References:
These steps current an entire info to determining and coping with missing values in a dataset, making sure the information is ready for analysis and modeling. Each methodology has its professionals and cons, and the choice of methodology will rely upon the exact context and requirements of your analysis.
This pocket ebook is an prepare throughout the Data Cleaning course. You can reference the tutorial at this link.