Because the division and salary_range columns have a really excessive variety of lacking values, we determined to drop them. This choice has the potential downside of shedding probably helpful data relating to the prevalence of fraudulent actions in sure departments or wage ranges. Nevertheless, the elimination of those columns does open up room for extra dependable data from different variables with out chopping off an excessive amount of of the info.
Apparently, the variety of reliable job postings outweigh the fraudulent ones by fairly a margin.
Fairly the disparity, proper? There’s completely no approach we will draw unbiased conclusions from a dataset with such a excessive class imbalance. Due to this fact, our subsequent step was to randomly minimize down on the Respectable job postings till we have now a extra balanced dataset.
We additionally created a operate to assist us put together the uncooked textual content knowledge for machine studying (ML). It takes in uncooked textual content and applies a number of cleansing steps to make the textual content simpler for a machine studying mannequin to grasp. Right here’s how the operate cleans the info:
- Removes numbers and converts all textual content to lowercase.
- Removes frequent abbreviations, symbols, net addresses, and e mail addresses.
- Replaces newline characters with areas and removes textual content inside sq. brackets.
- Removes punctuation from the textual content.
- Normalizes phrase contractions, akin to altering “I’ve” to “I’ve”.
- Removes HTML tags from the textual content
- Tokenizes the textual content (splits it into particular person phrases), removes frequent phrases (cease phrases), and applies stemming (reduces phrases to their root kind).
Now, with the info neatened up, we will lastly transfer to drawing conclusions from it.
EDA is a vital step within the knowledge science course of that enables us to grasp the info’s construction, establish any anomalies or outliers, and spot patterns or tendencies. It’s throughout this stage that we clear our knowledge, coping with lacking or incorrect entries that would skew our predictions.
To start with, it’s necessary to know what variables are positively or negatively correlated with one another. To realize this, we visualized the correlation between the numerical columns within the DataFrame utilizing a heatmap.
Apparently, in keeping with the heatmap, most correlations seem weak, as indicated by the dominance of the lighter purple coloration, suggesting that there’s no sturdy correlation between these pairs of variables. Nevertheless, the function has_company_logo and has_questions have a noticeable constructive correlation of 0.23. This means that job postings with an organization emblem usually tend to have questions within the software course of.
Furthermore, one other fascinating discovering is that Australia has a considerably larger proportion of fraudulent job postings in comparison with the opposite international locations.
Let’s now have take a look at a few visualizations on fraudulent job postings from the views of various options:
In a nutshell, fraudulent job postings are mostly related to full-time roles which can be entry-level, largely requiring highschool or equal schooling degree, typically within the fields of administration and engineering. Faux postings additionally are likely to lack an organization emblem and are actually standard within the oil & power trade.
We additionally managed to extract the highest 10 key phrases that pretend job postings apparently love to make use of:
Now that every one our analysis questions have been answered, there’s one closing discovering that we imagine is price sharing. In keeping with the under field and whisker plot, fraudulent job postings are likely to have a fewer character rely in comparison with reliable job postings. One would’ve thought that pretend recruiters would put in further effort in making their postings as plausible as doable. However apparently, they are usually extra conservative on the period of time they spend on their keyboards.
With the rise of on-line job portals, it has turn out to be more and more easy for malicious entities to publish pretend job listings with the intent of scamming some poor, unsuspecting job seekers. Due to this fact, utilizing current knowledge, it’s crucial for us to develop strong classifiers that would root out any pretend job postings.
Utilizing our restricted dataset, we’ve skilled a complete of 5 fashions for this classification process.
1. Logistic Regression
A easy, but highly effective algorithm for binary classification, logistic regression is commonly the primary selection for any process involving classification. Naturally, it was our first decide as effectively. Listed below are the mannequin outcomes:
It’s necessary to keep in mind that we’re working with a balanced dataset, which got here at the price of dropping tens of 1000’s of entries. Due to this fact, it’s anticipated to get a decrease accuracy worth with logistic regression. Utilizing the unique (imbalanced) dataset, we have been in a position to receive a whopping accuracy of 96%.
2. Random Forest Classifier
Random Forest is an ensemble technique that gives a extra strong prediction by combining the predictions of a number of choice timber. Recognized for its excessive accuracy and skill to deal with imbalance datasets, this proved to be the proper match (no pun meant) for us.
First we used the entropy criterion:
Then we skilled the Random Forest classifier utilizing the Gini Impurity standards:
Random Forest classifier with Gini Impurity criterion gave us a very spectacular 90% accuracy.
3. Help Vector Machine Classifier
SVM is one more highly effective and versatile classifier which is often employed whereas dealing with high-dimensional knowledge. We imagine that it’s notably fitted to this process because of this. Listed below are outcomes:
4. Multinomial Naive Bayes Classifier
Multinomial NB is a probabilistic studying technique that’s broadly utilized in Pure Language Processing (NLP). The algorithm is predicated on the Bayes theorem and makes use of the frequency of phrases related to every tag within the coaching knowledge to foretell the tag of a brand new textual content, making it a classifier price making an attempt for a textual content classification downside like ours.
5. Ok-Nearest Neighbors Classifier
For non-text knowledge, Ok-Nearest Neighbors (KNN) was used. KNN is an easy algorithm that shops all out there instances and classifies new instances primarily based on a similarity measure, making it a superb baseline for classification duties. For us, that is the one mannequin that we skilled on non-text knowledge.
Abstract
The Random Forest (Gini) mannequin has the best accuracy and it additionally reveals a superb stability between precision, recall, and F1 rating, which means that it’s probably the most dependable mannequin among the many ones we’ve examined for the duty of classifying pretend job postings.
You will discover the .ipynb pocket book with the related code and visualization here.