Dealing with lacking values is an important step in making ready information for machine studying. This tutorial supplies examples of the way to handle lacking values utilizing Python, specializing in the Pandas library. We’ll import the mandatory libraries, learn the info, and discover numerous strategies to deal with lacking values.
You’ll be able to verify the complete code within the Jupyter Notebook
We start by importing the mandatory libraries for our information manipulation and evaluation duties.
import numpy as np
import pandas as pd
- NumPy: A basic package deal for scientific computing in Python. It supplies help for arrays, matrices, and quite a few mathematical features.
- Pandas: A robust information manipulation and evaluation library that gives information buildings and features wanted to work with structured information seamlessly.
References:
We learn the CSV file containing the NFL play-by-play information.
information = pd.read_csv("./NFLPlayByPlay2009-2017_v4.csv")
You’ll be able to obtain the dataset from the Kaggle website
Through the import, a warning signifies that some columns have combined information varieties. This may be addressed by specifying the dtype
possibility or setting low_memory=False
.
Output:
/tmp/ipykernel_23803/1150844578.py:1: DtypeWarning: Columns (25,51) have combined varieties. Specify dtype possibility on import or set low_memory=False.
information = pd.read_csv("./NFLPlayByPlay2009-2017_v4.csv")
References:
To get an outline of the info format, we examine the primary few rows of the dataframe.
information.head()
Output:
Date GameID Drive qtr down time TimeUnder TimeSecs PlayTimeDiff SideofField ... yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Season
0 2009-09-10 2009091000 1 1 NaN 15:00 15 3600 0.0 TEN ... NaN 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 NaN NaN 2009
1 2009-09-10 2009091000 1 1 1.0 14:53 15 3593 7.0 PIT ... 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 2009-09-10 2009091000 1 1 2.0 14:16 15 3556 37.0 PIT ... NaN 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 NaN NaN 2009
3 2009-09-10 2009091000 1 1 3.0 13:35 14 3515 41.0 PIT ... -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2009-09-10 2009091000 1 1 4.0 13:27 14 3507 8.0 PIT ... NaN 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 NaN NaN 2009
We calculate the variety of lacking values in every column.
missing_values_per_column = information.isnull().sum()
missing_values_per_column[0:10] # taking the primary ten columns
Output:
Date 0
GameID 0
Drive 0
qtr 0
down 61154
time 224
TimeUnder 0
TimeSecs 224
PlayTimeDiff 444
SideofField 528
dtype: int64
To grasp the proportion of lacking information, we calculate the whole variety of cells and the share of lacking values.
total_cells = np.product(information.form)
total_missing = missing_values_per_column.sum()
print('total_missing', total_missing)
print('total_cells', total_cells)
print('proportion lacking', (total_missing / total_cells) * 100)
Output:
total_missing 11505187
total_cells 41584176
proportion lacking 27.66722370547874
We will additionally depend the non-missing values in every column.
information[:].depend()
Output:
Date 407688
GameID 407688
Drive 407688
qtr 407688
down 346534
...
Win_Prob 382679
WPA 402147
airWPA 159187
yacWPA 158926
Season 407688
Size: 102, dtype: int64
Whereas not really useful, one solution to deal with lacking values is to take away rows that include them.
removed_rows_empty_data = information.dropna()
print(removed_rows_empty_data)
Output:
Empty DataFrame
Columns: [Date, GameID, Drive, qtr, down, time, TimeUnder, TimeSecs, PlayTimeDiff, SideofField, yrdln, yrdline100, ydstogo, ydsnet, GoalToGo, FirstDown, posteam, DefensiveTeam, desc, PlayAttempted, Yards.Gained, sp, Touchdown, ExPointResult, TwoPointConv, DefTwoPoint, Safety, Onsidekick, PuntResult, PlayType, Passer, Passer_ID, PassAttempt, PassOutcome, PassLength, AirYards, YardsAfterCatch, QBHit, PassLocation, InterceptionThrown, Interceptor, Rusher, Rusher_ID, RushAttempt, RunLocation, RunGap, Receiver, Receiver_ID, Reception, ReturnResult, Returner, BlockingPlayer, Tackler1, Tackler2, FieldGoalResult, FieldGoalDistance, Fumble, RecFumbTeam, RecFumbPlayer, Sack, Challenge.Replay, ChalReplayResult, Accepted.Penalty, PenalizedTeam, PenaltyType, PenalizedPlayer, Penalty.Yards, PosTeamScore, DefTeamScore, ScoreDiff, AbsScoreDiff, HomeTeam, AwayTeam, Timeout_Indicator, Timeout_Team, posteam_timeouts_pre, HomeTimeouts_Remaining_Pre, AwayTimeouts_Remaining_Pre, HomeTimeouts_Remaining_Post, AwayTimeouts_Remaining_Post, No_Score_Prob, Opp_Field_Goal_Prob, Opp_Safety_Prob, Opp_Touchdown_Prob, Field_Goal_Prob, Safety_Prob, Touchdown_Prob, ExPoint_Prob, TwoPoint_Prob, ExpPts, EPA, airEPA, yacEPA, Home_WP_pre, Away_WP_pre, Home_WP_post, Away_WP_post, Win_Prob, WPA, airWPA, yacWPA, ...]
Index: []
[0 rows x 102 columns]
A extra widespread strategy is to take away columns that include lacking values.
removed_columns_empty_data = information.dropna(axis=1)
print(removed_columns_empty_data)
Output:
Date GameID Drive qtr TimeUnder ydstogo ydsnet PlayAttempted Yards.Gained sp ... AwayTeam Timeout_Indicator posteam_timeouts_pre HomeTimeouts_Remaining_Pre AwayTimeouts_Remaining_Pre HomeTimeouts_Remaining_Post AwayTimeouts_Remaining_Post ExPoint_Prob TwoPoint_Prob Season
0 2009-09-10 2009091000 1 1 15 0 0 1 39 0 ... TEN 0 3 3 3 3 3 0.0 0.0 2009
1 2009-09-10 2009091000 1 1 15 10 5
We then calculate the influence of this operation by evaluating the variety of columns earlier than and after.
print("unique columns: %d n" % information.form[1])
print("cleaned columns: %d n" % removed_columns_empty_data.form[1])
Output:
unique columns: 102
cleaned columns: 37
To deal with a smaller portion of the dataset, we will create a subset.
subset_nfl_data = information.loc[:, 'EPA':'Season'].head()
subset_nfl_data
Output:
EPA airEPA yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Season
0 2.014474 NaN NaN 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 NaN NaN 2009
1 0.077907 -1.068169 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 -1.402760 NaN NaN 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 NaN NaN 2009
3 -1.712583 3.318841 -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2.097796 NaN NaN 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 NaN NaN 2009
A simple methodology for dealing with lacking values is to fill them with a selected worth, equivalent to zero.
filled_basic_data = information.fillna(0)
filled_basic_data.head()
Output:
Date GameID Drive qtr down time TimeUnder TimeSecs PlayTimeDiff SideofField ... yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Season
0 2009-09-10 2009091000 1 1 0.0 15:00 15 3600 0.0 TEN ... 0.000000 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 0.000000 0.000000 2009
1 2009-09-10 2009091000 1 1 1.0 14:53 15 3593 7.0 PIT ... 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 2009-09-10 2009091000 1 1 2.0 14:16 15 3556 37.0 PIT ... 0.000000 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 0.000000 0.000000 2009
3 2009-09-10 2009091000 1 1 3.0 13:35 14 3515 41.0 PIT ... -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2009-09-10 2009091000 1 1 4.0 13:27 14 3507 8.0 PIT ... 0.000000 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 0.000000 0.000000 2009
One other strategy is to fill lacking values primarily based on the subsequent legitimate statement within the column.
column_based_fill = information.bfill(axis=0).fillna(0)
column_based_fill.head()
Output:
Date GameID Drive qtr down time TimeUnder TimeSecs PlayTimeDiff SideofField ... yacEPA Home_WP_pre Away_WP_pre Home_WP_post Away_WP_post Win_Prob WPA airWPA yacWPA Season
0 2009-09-10 2009091000 1 1 1.0 15:00 15 3600 0.0 TEN ... 1.146076 0.485675 0.514325 0.546433
References:
These steps present a complete information to figuring out and dealing with lacking values in a dataset, making certain the info is prepared for evaluation and modeling. Every methodology has its professionals and cons, and the selection of methodology will depend on the precise context and necessities of your evaluation.
This pocket book is an train within the Data Cleaning course. You’ll be able to reference the tutorial at this link.