Machine Learning Handling Missing Values | by Gerardo Perrucci | Jun, 2024

Dealing with lacking values is an important step in making ready information for machine studying. This tutorial supplies examples of the way to handle lacking values utilizing Python, specializing in the Pandas library. We’ll import the mandatory libraries, learn the info, and discover numerous strategies to deal with lacking values.

You’ll be able to verify the complete code within the Jupyter Notebook

We start by importing the mandatory libraries for our information manipulation and evaluation duties.

import numpy as np
import pandas as pd

NumPy: A basic package deal for scientific computing in Python. It supplies help for arrays, matrices, and quite a few mathematical features.
Pandas: A robust information manipulation and evaluation library that gives information buildings and features wanted to work with structured information seamlessly.

References:

We learn the CSV file containing the NFL play-by-play information.

information = pd.read_csv("./NFLPlayByPlay2009-2017_v4.csv")

You’ll be able to obtain the dataset from the Kaggle website

Through the import, a warning signifies that some columns have combined information varieties. This may be addressed by specifying the dtype possibility or setting low_memory=False.

Output:

/tmp/ipykernel_23803/1150844578.py:1: DtypeWarning: Columns (25,51) have combined varieties. Specify dtype possibility on import or set low_memory=False.
information = pd.read_csv("./NFLPlayByPlay2009-2017_v4.csv")

References:

To get an outline of the info format, we examine the primary few rows of the dataframe.

information.head()

Output:

Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  PlayTimeDiff  SideofField  ...  yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA  airWPA  yacWPA  Season
0  2009-09-10  2009091000      1    1   NaN  15:00         15     3600           0.0          TEN  ...     NaN     0.485675     0.514325     0.546433     0.453567  0.485675  0.060758     NaN     NaN    2009
1  2009-09-10  2009091000      1    1   1.0  14:53         15     3593           7.0          PIT  ...  1.146076     0.546433     0.453567     0.551088     0.448912  0.546433  0.004655 -0.032244  0.036899    2009
2  2009-09-10  2009091000      1    1   2.0  14:16         15     3556          37.0          PIT  ...     NaN     0.551088     0.448912     0.510793     0.489207  0.551088 -0.040295     NaN     NaN    2009
3  2009-09-10  2009091000      1    1   3.0  13:35         14     3515          41.0          PIT  ... -5.031425     0.510793     0.489207     0.461217     0.538783  0.510793 -0.049576  0.106663 -0.156239    2009
4  2009-09-10  2009091000      1    1   4.0  13:27         14     3507           8.0          PIT  ...     NaN     0.461217     0.538783     0.558929     0.441071  0.461217  0.097712     NaN     NaN    2009

We calculate the variety of lacking values in every column.

missing_values_per_column = information.isnull().sum()
missing_values_per_column[0:10] # taking the primary ten columns

Output:

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

To grasp the proportion of lacking information, we calculate the whole variety of cells and the share of lacking values.

total_cells = np.product(information.form)
total_missing = missing_values_per_column.sum()

print('total_missing', total_missing)
print('total_cells', total_cells)
print('proportion lacking', (total_missing / total_cells) * 100)

Output:

total_missing 11505187
total_cells 41584176
proportion lacking 27.66722370547874

We will additionally depend the non-missing values in every column.

information[:].depend()

Output:

Date        407688
GameID      407688
Drive       407688
qtr         407688
down        346534
...
Win_Prob    382679
WPA         402147
airWPA      159187
yacWPA      158926
Season      407688
Size: 102, dtype: int64

Whereas not really useful, one solution to deal with lacking values is to take away rows that include them.

removed_rows_empty_data = information.dropna()
print(removed_rows_empty_data)

Output:

Empty DataFrame
Columns: [Date, GameID, Drive, qtr, down, time, TimeUnder, TimeSecs, PlayTimeDiff, SideofField, yrdln, yrdline100, ydstogo, ydsnet, GoalToGo, FirstDown, posteam, DefensiveTeam, desc, PlayAttempted, Yards.Gained, sp, Touchdown, ExPointResult, TwoPointConv, DefTwoPoint, Safety, Onsidekick, PuntResult, PlayType, Passer, Passer_ID, PassAttempt, PassOutcome, PassLength, AirYards, YardsAfterCatch, QBHit, PassLocation, InterceptionThrown, Interceptor, Rusher, Rusher_ID, RushAttempt, RunLocation, RunGap, Receiver, Receiver_ID, Reception, ReturnResult, Returner, BlockingPlayer, Tackler1, Tackler2, FieldGoalResult, FieldGoalDistance, Fumble, RecFumbTeam, RecFumbPlayer, Sack, Challenge.Replay, ChalReplayResult, Accepted.Penalty, PenalizedTeam, PenaltyType, PenalizedPlayer, Penalty.Yards, PosTeamScore, DefTeamScore, ScoreDiff, AbsScoreDiff, HomeTeam, AwayTeam, Timeout_Indicator, Timeout_Team, posteam_timeouts_pre, HomeTimeouts_Remaining_Pre, AwayTimeouts_Remaining_Pre, HomeTimeouts_Remaining_Post, AwayTimeouts_Remaining_Post, No_Score_Prob, Opp_Field_Goal_Prob, Opp_Safety_Prob, Opp_Touchdown_Prob, Field_Goal_Prob, Safety_Prob, Touchdown_Prob, ExPoint_Prob, TwoPoint_Prob, ExpPts, EPA, airEPA, yacEPA, Home_WP_pre, Away_WP_pre, Home_WP_post, Away_WP_post, Win_Prob, WPA, airWPA, yacWPA, ...]
Index: []
[0 rows x 102 columns]

A extra widespread strategy is to take away columns that include lacking values.

removed_columns_empty_data = information.dropna(axis=1)
print(removed_columns_empty_data)

Output:

Date      GameID  Drive  qtr  TimeUnder  ydstogo  ydsnet  PlayAttempted  Yards.Gained  sp  ...  AwayTeam  Timeout_Indicator  posteam_timeouts_pre HomeTimeouts_Remaining_Pre  AwayTimeouts_Remaining_Pre  HomeTimeouts_Remaining_Post  AwayTimeouts_Remaining_Post  ExPoint_Prob  TwoPoint_Prob  Season
0      2009-09-10  2009091000      1    1         15        0       0               1            39   0  ...       TEN                  0                        3                          3                          3                            3                            3           0.0            0.0    2009
1      2009-09-10  2009091000      1    1         15       10       5

We then calculate the influence of this operation by evaluating the variety of columns earlier than and after.

print("unique columns: %d n" % information.form[1])
print("cleaned columns: %d n" % removed_columns_empty_data.form[1])

Output:

unique columns: 102
cleaned columns: 37

To deal with a smaller portion of the dataset, we will create a subset.

subset_nfl_data = information.loc[:, 'EPA':'Season'].head()
subset_nfl_data

Output:

EPA    airEPA    yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season
0  2.014474       NaN       NaN     0.485675     0.514325     0.546433     0.453567  0.485675  0.060758       NaN       NaN    2009
1  0.077907 -1.068169  1.146076     0.546433     0.453567     0.551088     0.448912  0.546433  0.004655 -0.032244  0.036899    2009
2 -1.402760       NaN       NaN     0.551088     0.448912     0.510793     0.489207  0.551088 -0.040295       NaN       NaN    2009
3 -1.712583  3.318841 -5.031425     0.510793     0.489207     0.461217     0.538783  0.510793 -0.049576  0.106663 -0.156239    2009
4  2.097796       NaN       NaN     0.461217     0.538783     0.558929     0.441071  0.461217  0.097712       NaN       NaN    2009

A simple methodology for dealing with lacking values is to fill them with a selected worth, equivalent to zero.

filled_basic_data = information.fillna(0)
filled_basic_data.head()

Output:

Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  PlayTimeDiff  SideofField  ...    yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season
0  2009-09-10  2009091000      1    1   0.0  15:00         15     3600           0.0          TEN  ...  0.000000     0.485675     0.514325     0.546433     0.453567  0.485675  0.060758  0.000000  0.000000    2009
1  2009-09-10  2009091000      1    1   1.0  14:53         15     3593           7.0          PIT  ...  1.146076     0.546433     0.453567     0.551088     0.448912  0.546433  0.004655 -0.032244  0.036899    2009
2  2009-09-10  2009091000      1    1   2.0  14:16         15     3556          37.0          PIT  ...  0.000000     0.551088     0.448912     0.510793     0.489207  0.551088 -0.040295  0.000000  0.000000    2009
3  2009-09-10  2009091000      1    1   3.0  13:35         14     3515          41.0          PIT  ... -5.031425     0.510793     0.489207     0.461217     0.538783  0.510793 -0.049576  0.106663 -0.156239    2009
4  2009-09-10  2009091000      1    1   4.0  13:27         14     3507           8.0          PIT  ...  0.000000     0.461217     0.538783     0.558929     0.441071  0.461217  0.097712  0.000000  0.000000    2009

One other strategy is to fill lacking values primarily based on the subsequent legitimate statement within the column.

column_based_fill = information.bfill(axis=0).fillna(0)
column_based_fill.head()

Output:

Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  PlayTimeDiff  SideofField  ...    yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season
0  2009-09-10  2009091000      1    1   1.0  15:00         15     3600           0.0          TEN  ...  1.146076     0.485675     0.514325     0.546433

References:

These steps present a complete information to figuring out and dealing with lacking values in a dataset, making certain the info is prepared for evaluation and modeling. Every methodology has its professionals and cons, and the selection of methodology will depend on the precise context and necessities of your evaluation.

This pocket book is an train within the Data Cleaning course. You’ll be able to reference the tutorial at this link.

Source link

Machine Learning Handling Missing Values | by Gerardo Perrucci | Jun, 2024

Working with Input-Convex Neural Networks part3(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

Embracing the Future: The Rise of AI-Driven Development in Software Engineering The software… | by DevBlogs | Jul, 2024

Research on Metaheuristic methods part4(Machine Learning 2024) | by Monodeep Mukherjee | Jul, 2024

LogicMonitor Seeks to Disrupt AI Landscape with an $800 Million Strategic Investment at a Valuation of Approximately $2.4 Billion to Revolutionize Data Centers

Denodo Platform 9.1 Brings New Advanced AI Capabilities and Enhanced Data Lakehouse Performance

Harnessing AI in Agriculture – insideAI News

How Big Data Is Transforming Patient Care Delivery

How to Assist Human Agents & Transform Customer Experience with Conversational AI?

Our Picks

6 Best Purchase Order Software in 2024

AI “godfather” Yoshua Bengio joins UK project to prevent AI catastrophes

Exploring the Thrilling World of Challenge.gg: A Journey of Growth and Community | by Marvellous Udosen | May, 2024

Most Popular

Revolutionizing the Way We Find Love

Will GenAI Replace Data Engineers? No – And Here’s Why.

Assortment Optimization Machine Learning | by Danishaliarshar | Mar, 2024

Machine Learning Handling Missing Values | by Gerardo Perrucci | Jun, 2024

Related Posts