Machine Studying — NBA Participant Wage
Constructing a linear regression mannequin to estimate NBA participant wage primarily based on in-game stats.
Just lately, I harnessed the ability of AI to precisely estimate the salaries of each NBA participant. By telling the AI the in-game stats (from nba.com) and salaries (from hoopshype.com) of all gamers from the earlier 4 seasons, the AI was capable of inform me simply how over/underpaid a participant was this season.
This device is nice as a result of it permits each informal followers and NBA analysts to shortly see which gamers produce the perfect worth for cash, whether or not that’s for protecting gamers throughout the wage cap, or drafting your subsequent fantasy workforce.
With out additional ado, let’s take a look at the method that went into constructing this machine studying mannequin.
Function Choice
Wage estimations have been primarily based on a participant’s in-game stats. Initially, we thought of utilizing a mixture of each offensive (e.g. PPG, AST) and defensive (BLK, STL) stats to estimate wage. All options are ‘fascinating’ that means that the upper the quantity, the higher (extra factors per recreation, assists, FG%, rebounds are all fascinating).
A stat corresponding to turnovers was not chosen as a characteristic, as extra turnovers isn’t fascinating. Additionally, on condition that assists and turnovers are typically correlated (e.g. Luca Dončić is 2nd within the league in each assists and turnovers), the chance of the mannequin ‘rewarding’ turnovers because of collinearity with assists was another excuse why turnovers weren’t used.
The extent to which every chosen stat/characteristic impacts participant wage can be decided by the machine studying algorithm.
Information Sources
Wage estimations have been primarily based on a participant’s in-game stats. nba.com homes up-to-date, dependable, and correct stats for all gamers this season, and so all in-game information got here from this supply.
To know whether or not these estimations have been correct, we additionally wanted to know the precise salaries for every participant. NBA.com didn’t have this info, and so we used hoopshype.com to get all participant salaries.
To extract this info, we created a Java undertaking and used a WebDriver to robotically choose the related information by means of net scraping.
Information Storage
As soon as we have been capable of extract the info from each web sites, we created an area MySQL database and tables to retailer this info.
To mix the info from nba.com and hoopshype.com into one desk, participant identify was used because the frequent identifier. This course of included standardising the identify format from each sources. For instance, all durations have been faraway from names like Jaren Jackson Jr. from nba.com to match the formatting of hoopshype.com (no interval).
Coaching Information
When constructing a machine studying algorithm, it’s common follow to separate your information into 80% coaching information, and 20% testing information. We need to use our wage estimation mannequin on gamers for this season, in order that was our testing information. Subsequently, we used the stats from the earlier 4 NBA seasons (2019–2023) as our coaching information to coach the mannequin. Participant identify and yr served as a composite key to uniquely determine a row within the desk.
Information Transformation
Because the wage cap will increase, gamers are inclined to earn increasingly more cash every season. Stats like common PPG and 3P% may also change from one season to the subsequent, because of tendencies in how groups play basketball e.g. spacing, tempo, rule adjustments and many others. For that reason, any stats from the earlier seasons have been normalised to a 2023–24 customary.
The yearly averages for every season have been retrieved from basketball-reference.com to calculate the multiplier that will carry all stats to 2023–24 customary.
For instance, in 2023–2024 the common PPG was 115.2. In 2019–2020, this worth was 111.8. 115.2/111.8 = 1.03 (2 d.p). That signifies that for any PPG information from 2019–2020, this might be multiplied by 1.03 to match what we’d count on to happen within the present NBA season.
Through the use of a relative multiplier on all information, the mannequin will subsequently be educated with information extra carefully aligned to a 2023–2024 season, and so be capable of determine wage calibrated to that season.
Information Normalisation
It was related to make sure that all stats used proportional scaling. For instance, FGP is from 0–100, whereas STL has no restrict however usually ranges from 0–3. With out scaling, it’s attainable that the machine studying algorithm may place a better emphasis on FGP than STL, solely as a result of FGP makes use of a bigger scale.
To mitigate this threat, min-max scaling was used to normalise all values to between 0–1.
Linear Regression Modelling
As soon as information pre-processing was full, this info may very well be used to coach the wage estimator. A linear regression mannequin was educated on information from the earlier 4 seasons. Totally different correlation coefficients have been thought of for every characteristic to minimise squared variations between the precise wage and estimated worth. Every characteristic was given a weighting, with a better weighting indicating a comparatively sturdy correlation between that characteristic and wage.
Function Elimination
From the preliminary weightings, it was clear that options corresponding to PPG and AST have been positively correlated with wage. Nevertheless, options corresponding to FGP have been in reality negatively correlated with wage. Though initially we’d count on a better FGP to end in extra wage, there’s a legitimate cause for this counterintuitive development. Function gamers are inclined to solely take pictures they’re comfy with, leading to a comparatively excessive FGP. Conversely, a ‘star’ who takes many pictures in a recreation, a few of that are closely contested, could not have as excessive FGP, regardless that their contribution to the sport is extra vital (and subsequently garners a better wage). To mitigate the chance of the mannequin ‘punishing’ gamers with an excellent FGP, this characteristic was faraway from the mannequin.
Throughout this course of, a brand new characteristic RSO (Relative Scoring Output) was additionally constructed, which was an aggregation of each PPG and 3p%. Finally, this characteristic was additionally eliminated because of the lack of correlation between 3p% and wage. By way of a number of iterations, options have been thought of and eliminated till solely 3 remained.
The lambda worth was additionally experimented with (for ridge regression) nevertheless this didn’t considerably affect the coefficients, and so was not thought of within the mannequin.
Modelling an Exponential Relationship
Up thus far, the linear regression algorithm assumed that the correlation between the standard of a participant and their wage was linear. For instance, common gamers obtain 50% of the max wage, and a few of the greatest gamers may obtain 80 and even 90% of the max wage. Nevertheless, participant wage isn’t evenly distributed primarily based on relative efficiency. Oftentimes, the perfect of the perfect earn far more cash than their counterparts.
From the scatter plot above, there’s a clear exponential relationship between participant high quality and wage. To symbolize this relationship within the linear regression mannequin, the min-max values (0 to 1) have been remodeled to a brand new scale (-4 to 2), after which exponentiated.
Because of this transformation, estimated wage will enhance exponentially from a ‘worse’ to a ‘higher’ participant, mimicking the real-life development.
Estimating Participant Wage
As soon as the ultimate weightings have been decided, these values may very well be utilized to the stats of any participant within the 2023–24 season to estimate their wage.
Instance of mannequin estimating participant wage for various gamers
Calculating Pay Disparity
1. Literal pay hole = Precise wage — predicted wage. A optimistic worth would imply the participant is overpaid, and a adverse worth would imply they’re underpaid.
2. Proportional pay hole = Precise wage / predicted wage. A worth > 1 would point out the participant is overpaid, and a worth < 1 would imply they’re underpaid.
It was necessary to incorporate each measures of pay disparity when contemplating who’s most over/underpaid. For instance, a participant who earns $500k however must be on $3 million has a literal pay hole of $2.5 million and a proportional pay hole of round 0.16x their predicted wage. By comparability, a participant who earns $20 million however must be on $25 million has a better literal pay hole of $5 million, however the proportional pay hole is far nearer (0.8x their predicted wage).
Proof of most underpaid gamers by proportional hole displayed.
From the outcomes above, we will see that calculating most underpaid gamers by proportional hole tends to favour gamers who have been paid comparatively low salaries within the first place (as there may be a variety of potential for relative development). To mitigate this threat, the search was restricted solely to gamers incomes at the very least $1 million.
After this adjustment, essentially the most overpaid and underpaid gamers of the 2023–24 season may very well be decided.
Findings
By evaluating the highest outcomes for each literal and proportional wage, we may recommend who have been essentially the most ‘overpaid’ and ‘underpaid’ gamers of the 2023–24 season.
Most overpaid gamers (In accordance with AI)
PG — Ben Simmons
SG — Klay Thompson
SF — Gordon Hayward
PF — Reggie Bullock Jr
C — Davis Bertans
sixth man — Joe Harris
Most underpaid gamers (In accordance with AI)
PG — Tyrese Haliburton
SG — Tyrese Maxey
SF — Anthony Edwards
PF — GG Jackson
C — Alperen Şengün
sixth man — Jalen Williams
You may learn a extra in-depth report on why these gamers are over/underpaid on my tales web page.
Future Enhancements
There are some enhancements that may very well be made to future variations of this mannequin.
One enchancment can be permitting the consumer to specify which stats they deem necessary, together with superior stats like participant affect estimate. The consumer may additionally manually set the weightings for every stat primarily based on the wants of a workforce, corresponding to valuing 3p% for 3-point shooters.
One other characteristic can be permitting the consumer to set a funds, after which displaying underpaid gamers in that funds. They might additionally filter outcomes by different elements like place, age, years left on contract and many others. This may very well be helpful for each NBA groups making an attempt to remain inside their wage cap, and followers trying to draft gamers to their fantasy workforce.
Last Ideas
Gamers like PJ Tucker have proven that stats alone don’t at all times decide the affect a participant has on the sport. Nevertheless, this machine studying mannequin can function a fast and environment friendly option to determine pay disparities and uncover ‘hidden gems’ within the league.
Truthful Use Assertion
The utilisation of participant statistics from NBA.com, participant salaries from HoopsHype.com, and yearly averages from Basketball-Reference.com within the creation of the mannequin constitutes truthful use underneath copyright legislation. The usage of this information is transformative in nature, involving the extraction, normalisation, and evaluation of the info to develop an unique machine studying mannequin for estimating NBA participant salaries.
The mannequin depends on the aggregation and processing of publicly accessible information to generate insights and predictions concerning participant salaries. It has not merely replicated or redistributed the unique information however has remodeled it right into a novel kind for the aim of statistical evaluation and mannequin improvement.
The usage of the info doesn’t negatively affect the market worth of the unique works, nor does it inhibit the power of the copyright holders to derive revenue from their information. As an alternative, it contributes to the development of information and innovation within the discipline of sports activities analytics by offering invaluable insights into participant valuation and wage tendencies throughout the NBA.
Moreover, the article serves an academic and informational function, offering readers with insights into the methodology and findings of the analysis. By sharing the method and outcomes, the goal is to foster understanding and dialogue across the intersection of information science {and professional} sports activities.
In accordance with truthful use rules, the unique sources of the info used within the evaluation have been attributed, and the use has not exceeded the scope of transformative function. It’s believed that using this information falls throughout the bounds of truthful use and complies with relevant copyright legal guidelines.
If there are any questions on this text, please be at liberty to get involved at lukejamesmccabe@gmail.com