In my earlier story, Half 1 of this subject — https://medium.com/towards-artificial-intelligence/how-to-use-target-encoding-in-machine-learning-credit-risk-models-part-1-cf72f30f51a8
We lined the derivation of the expression of WoE utilizing most chance. Now, we’ll apply it virtually on a random dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from scipy.particular import logit, expit
We’ve got imported some particular features – logit and expit. We’ll use them in our script as a result of logit is the inverse logistic operate and expit is the logistic operate.
print(expit(0))
print(logit(0.5))
Subsequent step is to generate the y-vector randomly. For this we use Bernoulli distribution (truly we use Binomial distribution the place n = 1). Every Bernoulli trial will produce both a 1 or 0 based mostly on the likelihood parameter. And the n samples we generate correspond to the n trials of the Bernoulli distribution. Observe that Binomial distribution has two parameters — n and p. Subsequently, randomly sampling from Binomial requires us to specify each n and p beforehand and the “dimension” which is the variety of experiments. Every experiment includes n trials which is able to produce the “variety of successes” which is the random variable for the Binomial distribution.
y = np.random.binomial(1, 0.01, 10000)
print(sum(y))
np.random.binomial(10000, 0.01, 10)
It’s evident that the array generated utilizing Bernoulli is what we wish which is the vector y with 1s and 0s. Nonetheless, the array generated utilizing Binomial is sum(y) or sum of occasions in every Binomial experiment (every experiment has n = 10000 trials) carried out 10 instances. We don’t need that for our evaluation.
Then we generate two distributions for the variable “x” for which we have to compute the “weight of proof” or which we wish to goal encode.
x_1 = np.random.regular(130, 15, 114)
x_0 = np.random.regular(100, 20, 9886)
plt.hist(x_1)
plt.hist(x_0)
Observe that now we have cleverly randomly sampled x_1 and x_0 such that there’s a larger proportion of the variety of occasions (1s) for the “center portion” of x values. This may give us a monotonic pattern as we’ll see later. However the instinct is that the occasion fee is decrease for decrease values of x and will increase as values of x improve. That is ensured by our sampling strategy of choosing the proper imply and commonplace deviation for the Regular distributions for x_1 and x_0.
Now we type the vector y to make sure that we “stack” x_0 and x_1 akin to y=0 and y=1, respectively. Lastly, we convert the np arrays right into a dataframe.
y = np.type(y)
print(y)
x = np.concatenate((x_0, x_1), axis=None)
df = pd.DataFrame({'y':y, 'x':x})
Any longer, begins are primary WoE code.
####### which function we wish a WoE for?
col = 'x'
We are able to “bin” our variable x in two methods — equally proportioned bins utilizing the percentile methodology or consumer specified bins.
####### create a brand new dataframe with the binned values of the function 'col' utilizing the percentile methodology
####### To keep away from coping with Pandas collection, create a brand new df and set its column to the output from pd.qcut
df_bins = pd.DataFrame()
df_bins[col] = pd.qcut(df[col], 5)
print(df_bins[col].value_counts())
Utilizing pd.qcut() we see how the bins have been equally proportioned.
We are able to additionally use pd.lower() and specify the bins that we wish, which might not be equally proportioned. Observe that the “embody lowest = True” ensures that the x.min() is included within the lowest interval. Thus, we see 23.999 because the lowest finish level of the bottom bin to incorporate x = 24
####### create a brand new dataframe with the binned values of the function 'col' utilizing consumer outlined bins
####### embody lowest = True ensures that df[col].min() is included within the binned interval.
df_bins = pd.DataFrame()
df_bins[col] = pd.lower(df[col], bins = [24, 95, 115, 188], include_lowest = True)
print(df_bins[col].value_counts())
That’s how our binned dataframe for x seems to be like
We outline a dataframe for calculating WoE
woe_df = pd.concat([df['y'], df_bins], axis=1)
And we calculate WoE utilizing the one-line code (notice the usage of the logit operate mentioned earlier). Please consult with my earlier article to seek out out the rationale behind the next script — https://medium.com/towards-artificial-intelligence/how-to-use-target-encoding-in-machine-learning-credit-risk-models-part-1-cf72f30f51a8
woe_x = logit(woe_df.groupby('x')['y'].sum() / woe_df.groupby('x')['y'].depend()) - logit(woe_df['y'].sum() / len(woe_df))
It’s attention-grabbing to see that the bottom bin has unfavorable infinity WoE. We’ll see the rationale for that later. However first we plot the WoE values for x.
fig, ax = plt.subplots()
woe_x.plot(ax=ax, coloration = 'r', linewidth = 3)
plt.xlabel('x')
plt.ylabel('WoE')
ax.set_xticks(ax.get_xticks())
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='proper')
plt.present()
As anticipated we see a pleasant monotonic pattern of WoE values for x. This means that as x will increase, the likelihood of occasion or consequence y occurring will increase as nicely.
The next can also be an excellent and different methodology to calculate WoE values. It additionally lists down the whole “desk” of calculations so it’s useful to have a look at.
def calculate_woe_iv(dataset, function, goal):
lst = []
for i in vary(dataset[feature].nunique()):
val = record(dataset[feature].distinctive())[i]
lst.append({
'Worth': val,
'All': dataset[dataset[feature] == val].depend()[feature],
'Good': dataset[(dataset[feature] == val) & (goal == 0)].depend()[feature],
'Unhealthy': dataset[(dataset[feature] == val) & (goal == 1)].depend()[feature]
})dset = pd.DataFrame(lst)
dset['Distr_Good'] = dset['Good'] / dset['Good'].sum()
dset['Distr_Bad'] = dset['Bad'] / dset['Bad'].sum()
dset['WoE'] = np.log(dset['Distr_Bad'] / dset['Distr_Good'])
dset = dset.change({'WoE': {np.inf: 0, -np.inf: 0}})
dset['IV'] = (dset['Distr_Bad'] - dset['Distr_Good']) * dset['WoE']
iv = dset['IV'].sum()
dset = dset.sort_values(by='Worth') # type by values and never by WoEs to detect non-monotonic tendencies
return dset, iv
We are able to additionally calculate the knowledge worth for the variable x, which quantifies the predictive energy of x in predicting y. We use the identical bins as we used earlier.
print('WoE and IV for variable: {}'.format(col))
woe_iv_df, iv = calculate_woe_iv(df_bins, col, woe_df['y'])
print(woe_iv_df)
print('IV rating: {:.2f}'.format(iv))
print('n')
The above consequence exhibits that x is sort of predictive as its IV is 1.75 which is an effective worth of IV. We additionally see that the decrease bin has precisely 0 occasions. So clearly its WoE needed to be -infinity as we noticed earlier than. Nonetheless, this code snippet operate assigns a worth “0” to it which is fallacious per se however for plotting functions we will use that.
We plot the WoE values once more.
fig,ax = plt.subplots()
woe_iv_df_plt = woe_iv_df[['Value', 'WoE']].set_index('Worth')
woe_iv_df_plt.plot(ax=ax, coloration = 'r', linewidth = 3)
plt.xlabel('x')
plt.ylabel('WoE')
ax.set_xticks(ax.get_xticks())
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='proper')
plt.present()
it’s the identical pattern we noticed earlier sans the bottom interval WoE worth = 0. If we wish to keep away from that, we will use the “consumer specified bins” utilizing pd.lower as an alternative of pd.qcut. If we use that, we’ll get the next plot