- Lowering the Dimensionality of Job Titles:
The ‘job_title’ characteristic within the dataset has 152 distinctive values, every accounting for greater than 1% of complete knowledge factors. It consists of brief, normally two-word texts. By observing the values, we observed that many are comparable positions with completely different wording or slight variations in textual content. If we enter these 152 titles immediately into the mannequin, the dimension of our coaching knowledge will improve considerably.The dearth of clear classes may also make our consequence much less explainable. Due to this fact, we determined to group the commonest titles. NLP and clustering algorithms had been used to find preliminary tendencies. We observed the widespread key phrases in cases and used them as the usual. The ‘job_title’ was cut up into two new categorical options: job_category (Knowledge, Machine Studying, Enterprise Intelligence, Synthetic Intelligence, Analysis, Different) and job_type (Engineer, Analyst, Developer, Scientist, Specialist, Supervisor, Different). When processing a job title, the abbreviation in it, e.g. ‘BI’, ML’, was changed. Then, we checked if the title accommodates classes in ‘job_category’ and ‘job_type’. If there isn’t an overlapping string, ‘Different’ was assigned. For instance, ‘Machine Studying Infrastructure Engineer’ will develop into ‘Machine Studying’ and ‘Engineer’. 153 options had been diminished to six + 7 = 13. - Clustering Location Options:
The ‘employess_residence’ characteristic has 87 distinctive values, and the ‘company_location’ has 76. They encompass nation codes. Much like job_title, the quantity of classes makes it troublesome to encode, prepare, and perceive. We determined to maintain international locations accounting for greater than 50% of the information. For different knowledge factors, we cluster them primarily based on their continent code, as geographic proximity is more likely to create comparable job markets. For instance, ‘SE’ might be changed by ‘EU’.
Our final massive knowledge preprocessing activity was changing the explicit variables in our knowledge set into numerical values to be used in our fashions. Right here’s a breakdown of the methods we employed:
- Nominal Categorical Variables:
We recognized seven categorical variables, ‘employee_residence’, ‘company_location’, ‘remote_ratio’, ‘job_category’, ‘job_title’, ‘employment_type’, and ‘company_size’, as nominal. Nominal variables characterize classes with no inherent order or rating. To arrange these variables for the mannequin, we used one-hot encoding, which creates a brand new binary variable for every distinctive class throughout the unique variable. For instance, ‘remote_ratio’ has the classes “0”, “50”, and “100” which characterize distinct work preparations: absolutely in-office, partially distant, and absolutely distant. (“0”, “50”, and “100” would possibly seem ordinal, however as a result of the ratio shouldn’t affect wage, we recognized ‘remote_ratio’ as nominal). Utilizing one-hot encoding on ‘remote_ratio’ generates three new binary variables, one for every work association, permitting the mannequin to be taught the relationships between these distinct preparations and wage with out assuming any inherent order between them. - Ordinal Categorical Variable:
We recognized ‘experience_level’ as an ordinal categorical variable. Ordinal variables characterize classes with a pure order or rating. To protect this order in our mannequin, we utilized ordinal encoding. This methodology assigns a numerical worth to every class, reflecting its place throughout the order. In our case, ‘experience_level’ is encoded as “EN” (Entry-Stage) = 1, “MI” (Mid-Stage) = 2, “SE” (Senior) = 3, and “EX” (Govt) = 4, which permits the mannequin to grasp the rising affect of expertise on wage.
Linear Regression
A easy and computationally environment friendly mannequin that assumes a linear relationship between the impartial variables and the goal variable (salary_in_usd). We selected to begin our evaluation with linear regression as a result of it’s an simply interpretable mannequin that we hoped would give us a very good preliminary understanding of how every characteristic influences wage earlier than persevering with on to different fashions.
Help Vector Regression (SVR)
Help vector regression makes predictions utilizing assist vectors and the choice boundary in a hyperplane. It’s a versatile mannequin able to detecting each linear and non-linear relationships in knowledge. We selected SVR as a follow-up to linear regression. If the linear mannequin performs effectively, SVR can present additional proof of patterns in knowledge. In any other case, we will alter to different kernels (e.g. rbf) for additional evaluation.
Choice Tree Regressor (DTR)
A call tree works by recursively partitioning knowledge primarily based on variables (splitting standards) at every inner node. The splitting standards intention to attenuate the prediction error (regression) at every cut up. The ultimate prediction for a brand new knowledge level is obtained by traversing the tree from the foundation node to a leaf node primarily based on the sequence of choices made at every inner node. We selected this mannequin as a result of we wished to strive a number of sorts of regression evaluation to see what our greatest match might be.
Random Forest Regression (RFR)
We selected this mannequin because the logical subsequent step after a call tree. Random forest is an ensemble methodology that mixes a number of determination bushes into one mannequin. Every tree makes a prediction, and the ultimate prediction is the common of the person tree predictions. This method helps to scale back variance and enhance accuracy in comparison with a single determination tree.
Gradient Boosting Regressor (GBR)
An ensemble methodology that builds a number of determination bushes sequentially. Every subsequent tree is educated to handle the errors of the earlier tree, resulting in a extra strong mannequin. Gradient boosting can deal with advanced relationships and may be very highly effective for regression duties, which is why we selected to incorporate it in our regression evaluation.
We determined to look at the top-performing fashions from the earlier stage: Linear Regression, SVR, and GBR.
Precision & Variance
We educated the fashions on coaching knowledge to check for precision-variance tradeoff. There is no such thing as a important change in accuracy. This means that the performances of our fashions are steady and generalizable.
For all three fashions, coaching and testing accuracy converge as the dimensions of coaching knowledge will increase. There isn’t any recognizable divergence. Due to this fact, we will conclude that overfitting just isn’t an issue.
High-quality-tuning
Earlier than continuing, we examined SVR with a linear kernel and received an R-squared of 0.337. The efficiency is just like linear regression, which signifies a doable linear relationship.
We fine-tuned the linear regression and GBR utilizing cross validation and grid search.
Our path to discovering the optimum mannequin was a tedious one. We examined 5 completely different fashions and carried out parameter changes all through. We created a dummy regressor because the baseline for comparability. The desk beneath summarizes the efficiency of our fashions. All fashions carry out effectively. Imply squared errors are round 0.6, and R-squared is round 0.3, that are small relative to the dimensions of our targets. In comparison with the dummy regressor, the 5 fashions we examined considerably improved the accuracy. There was round a 40% lower in MSE, and R-squared elevated between 0.32 and 0.36.
The simplest mannequin from our testing was the GBR, with the best R-squared, lowest Root Imply Sq. Error (RMSE), and lowest Imply Sq. Error (MSE). Different top-performing fashions embrace Linear Regression and SVR. RFR performs higher than DTR, reflecting its means to interpret advanced categorical relationships by creating a number of determination bushes and successfully aggregating them. The outcomes are extraordinarily shut, and additional exploration is required.
Our goal with this mission was to create a machine studying mannequin that predicted knowledge science salaries primarily based on the traits of the job. All through this course of, knowledge preparation had the biggest contribution to our outcomes. These cleansing processes had been efficient in bettering the efficiency of fashions. Our experimentation with completely different studying algorithms helped present {that a} advanced dataset requires cautious preprocessing and experimentation with fashions to seize the advanced relationships within the dataset. Linear regression and ensemble strategies like random forest and gradient boosting all gave dependable outcomes with out overfitting. From the characteristic significance graph, we be taught that they uncover completely different relationships among the many characteristic variables.
The very best regression, gradient boosting, carried out effectively and predicted 36% of the variance within the dataset. Tuning the hyperparameters of the fashions to see which performs the very best and the quickest is the following step in optimizing the fashions. The fashions from this mission may additionally assist individuals within the area. A dashboard or portal the place somebody may make knowledgeable selections about job provides may go a good distance in informing individuals how a lot they’re value. Moreover, the information and methods apply to different datasets and would have the ability to predict salaries in fields past knowledge science. This mission serves as a foundation for additional exploration of wage transparency and the way knowledge science can play a task.
If we had extra time, our group would have favored to construct extra explainability into our mannequin in order that individuals who use it might probably perceive the options that affect wage predictions and the way our mannequin weighs them. Transparency in how the mannequin makes predictions is important in creating belief with customers and accountability for our outcomes. Moreover, transparency could assist people within the job market make extra knowledgeable selections about what job postings they may or won’t contemplate.
There’s additionally quite a lot of inherent bias in knowledge regarding wage (race, gender, location), and we’re conscious of the moral considerations that our mannequin may perpetuate these biases in doubtlessly damaging methods. Our fashions are decently correct, however we need to make it clear that our fashions are decently correct primarily based on coaching knowledge that’s in all probability biased. People utilizing our mannequin to make job selections primarily based on wage predictions ought to concentrate on this and issue that into their decision-making and understanding of our mannequin’s predictions. We need to make certain that our mannequin’s predictions don’t contribute to unfair labor practices (e.g. unfair pay) or exacerbate inequalities within the workforce.
When creating machine studying fashions for individuals to make use of as instruments of their each day lives, it’s very important to clarify {that a} mannequin is simply a device. If individuals use our mannequin to find out what jobs they need to take and apply for, they need to additionally use their important considering expertise and never rely solely on our mannequin.
Our mission allowed us to study machine studying and its real-world capabilities. By means of creating an end-to-end mission, we demystified among the machine studying instruments/algorithms that we encounter each day by wanting beneath the hood and making a mannequin of our personal.
Our crew discovered methods to work as a bunch and methods to facilitate a collaborative machine-learning mission. We discovered to make use of instruments like git and VS Code Stay Share to work on our code collectively and synchronously; and to make use of GitHub Points and Pull Requests to delegate duties.
Within the knowledge preprocessing stage of our mission, we gained beneficial perception into the significance of getting ready knowledge for our mannequin. Our dataset had quite a lot of categorical variables that we needed to preprocess utilizing completely different encoding methods — one-hot encoding for nominal variables and ordinal encoding for variables with inherent order — that highlighted the essential function of remodeling categorical variables right into a format appropriate with our regression fashions.
At first, our knowledge preprocessing didn’t go as anticipated. We had been fairly certain that we had encoded our knowledge accurately, however our fashions had been severely under-performing — our R-squared scores had been extraordinarily low. Confused, we began to vary the best way we had been encoding our variables, and we even tried completely different encoding strategies like label encoding. Our adjustments didn’t work. We realized this was as a result of our preliminary instinct was appropriate; we had encoded our knowledge accurately, however our knowledge had outliers that wanted addressing earlier than our mannequin may enhance. After we eliminated outliers from our knowledge, we noticed nice outcomes from our fashions. Our greatest R-squared ended at 0.36, signifying our mannequin’s means to foretell knowledge science salaries. This taught us loads in regards to the want for cautious knowledge cleansing and have engineering to make sure the standard and usefulness of our knowledge for correct outcomes (in our case, wage prediction).
Evaluating our machine studying fashions with metrics like R-squared and MSE was additionally a studying expertise for our group. It was fascinating to see every metrics means to seize wage tendencies. Whereas R-squared gave us a way of the proportion of variance in salaries defined by every mannequin, MSE provided a extra particular view of how effectively the fashions predicted precise salaries individually. The connection between metrics helped us determine fashions that defined general tendencies and minimized the common squared distinction between predicted and precise salaries. We discovered {that a} concentrate on each the broader match (R-squared) and particular person error (MSE) allowed us to make a extra knowledgeable determination about which fashions had been best for predicting knowledge science salaries in our dataset.
In the long run, discovering that gradient-boosted regression carried out the very best on our dataset taught us that for knowledge with advanced relationships, we will profit from utilizing a posh mannequin. In our case, the easier fashions proved much less efficient than the advanced fashions, demonstrating that generally a easy mannequin is simply that: too easy.
All in all, this semester’s Machine Studying mission capped off our studying with a enjoyable real-world mission that our group was capable of achieve loads from. Numerous our studying got here from trial and error, and whereas all of us acknowledge that trial and error is a crucial a part of the educational course of, if we may do all of it once more we might spend extra time on our preliminary knowledge visualization and get a greater understanding of our knowledge set to scale back the necessity for a lot trial and error. We spent quite a lot of time fixing errors and discovering points in our code that might have been faster to grasp with a greater understanding of our dataset.
View our code right here: https://github.com/bdeanhardt/DIS_ML_Final_Project