Use circumstances and code to discover the brand new class that helps tune determination thresholds in scikit-learn
The 1.5 launch of scikit-learn features a new class, TunedThresholdClassifierCV, making optimizing determination thresholds from scikit-learn classifiers simpler. A choice threshold is a cut-off level that converts predicted possibilities output by a machine studying mannequin into discrete courses. The default determination threshold of the .predict()
technique from scikit-learn classifiers in a binary classification setting is 0.5. Though it is a wise default, it’s hardly ever your best option for classification duties.
This submit introduces the TunedThresholdClassifierCV class and demonstrates the way it can optimize determination thresholds for varied binary classification duties. This new class will assist bridge the hole between knowledge scientists who construct fashions and enterprise stakeholders who make selections primarily based on the mannequin’s output. By fine-tuning the choice thresholds, knowledge scientists can improve mannequin efficiency and higher align with enterprise goals.
This submit will cowl the next conditions the place tuning determination thresholds is helpful:
- Maximizing a metric: Use this when selecting a threshold that maximizes a scoring metric, just like the F1 rating.
- Price-sensitive studying: Regulate the edge when the price of misclassifying a false optimistic shouldn’t be equal to the price of misclassifying a false damaging, and you’ve got an estimate of the prices.
- Tuning underneath constraints: Optimize the working level on the ROC or precision-recall curve to fulfill particular efficiency constraints.
The code used on this submit and hyperlinks to datasets can be found on GitHub.
Let’s get began! First, import the mandatory libraries, learn the information, and cut up coaching and check knowledge.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
RocCurveDisplay,
f1_score,
make_scorer,
recall_score,
roc_curve,
confusion_matrix,
)
from sklearn.model_selection import TunedThresholdClassifierCV, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScalerRANDOM_STATE = 26120
Maximizing a metric
Earlier than beginning the model-building course of in any machine studying challenge, it’s essential to work with stakeholders to find out which metric(s) to optimize. Making this determination early ensures that the challenge aligns with its meant targets.
Utilizing an accuracy metric in fraud detection use circumstances to judge mannequin efficiency shouldn’t be perfect as a result of the information is usually imbalanced, with most transactions being non-fraudulent. The F1 rating is the harmonic imply of precision and recall and is a greater metric for imbalanced datasets like fraud detection. Let’s use the TunedThresholdClassifierCV
class to optimize the choice threshold of a logistic regression mannequin to maximise the F1 rating.
We’ll use the Kaggle Credit Card Fraud Detection dataset to introduce the primary state of affairs the place we have to tune a choice threshold. First, cut up the information into prepare and check units, then create a scikit-learn pipeline to scale the information and prepare a logistic regression mannequin. Match the pipeline on the coaching knowledge so we will examine the unique mannequin efficiency with the tuned mannequin efficiency.
creditcard = pd.read_csv("knowledge/creditcard.csv")
y = creditcard["Class"]
X = creditcard.drop(columns=["Class"])X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)
# Solely Time and Quantity have to be scaled
original_fraud_model = make_pipeline(
ColumnTransformer(
[("scaler", StandardScaler(), ["Time", "Amount"])],
the rest="passthrough",
force_int_remainder_cols=False,
),
LogisticRegression(),
)
original_fraud_model.match(X_train, y_train)
No tuning has occurred but, but it surely’s coming within the subsequent code block. The arguments for TunedThresholdClassifierCV
are just like different CV
courses in scikit-learn, resembling GridSearchCV. At a minimal, the person solely must go the unique estimator and TunedThresholdClassifierCV
will retailer the choice threshold that maximizes balanced accuracy (default) utilizing 5-fold stratified Okay-fold cross-validation (default). It additionally makes use of this threshold when calling .predict()
. Nevertheless, any scikit-learn metric (or callable) can be utilized because the scoring
metric. Moreover, the person can go the acquainted cv
argument to customise the cross-validation technique.
Create the TunedThresholdClassifierCV
occasion and match the mannequin on the coaching knowledge. Go the unique mannequin and set the scoring to be “f1”. We’ll additionally need to set store_cv_results=True
to entry the thresholds evaluated throughout cross-validation for visualization.
tuned_fraud_model = TunedThresholdClassifierCV(
original_fraud_model,
scoring="f1",
store_cv_results=True,
)tuned_fraud_model.match(X_train, y_train)
# common F1 throughout folds
avg_f1_train = tuned_fraud_model.best_score_
# Evaluate F1 within the check set for the tuned mannequin and the unique mannequin
f1_test = f1_score(y_test, tuned_fraud_model.predict(X_test))
f1_test_original = f1_score(y_test, original_fraud_model.predict(X_test))
print(f"Common F1 on the coaching set: {avg_f1_train:.3f}")
print(f"F1 on the check set: {f1_test:.3f}")
print(f"F1 on the check set (unique mannequin): {f1_test_original:.3f}")
print(f"Threshold: {tuned_fraud_model.best_threshold_: .3f}")
Common F1 on the coaching set: 0.784
F1 on the check set: 0.796
F1 on the check set (unique mannequin): 0.733
Threshold: 0.071
Now that we’ve discovered the edge that maximizes the F1 rating examine tuned_fraud_model.best_score_
to seek out out what the perfect common F1 rating was throughout folds in cross-validation. We will additionally see which threshold generated these outcomes utilizing tuned_fraud_model.best_threshold_
. You possibly can visualize the metric scores throughout the choice thresholds throughout cross-validation utilizing the objective_scores_
and decision_thresholds_
attributes:
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(
tuned_fraud_model.cv_results_["thresholds"],
tuned_fraud_model.cv_results_["scores"],
marker="o",
linewidth=1e-3,
markersize=4,
colour="#c0c0c0",
)
ax.plot(
tuned_fraud_model.best_threshold_,
tuned_fraud_model.best_score_,
"^",
markersize=10,
colour="#ff6700",
label=f"Optimum cut-off level = {tuned_fraud_model.best_threshold_:.2f}",
)
ax.plot(
0.5,
f1_test_original,
label="Default threshold: 0.5",
colour="#004e98",
linestyle="--",
marker="X",
markersize=10,
)
ax.legend(fontsize=8, loc="decrease middle")
ax.set_xlabel("Resolution threshold", fontsize=10)
ax.set_ylabel("F1 rating", fontsize=10)
ax.set_title("F1 rating vs. Resolution threshold -- Cross-validation", fontsize=12)
# Examine that the coefficients from the unique mannequin and the tuned mannequin are the identical
assert (tuned_fraud_model.estimator_[-1].coef_ ==
original_fraud_model[-1].coef_).all()
We’ve used the identical underlying logistic regression mannequin to judge two completely different determination thresholds. The underlying fashions are the identical, evidenced by the coefficient equality within the assert assertion above. Optimization in TunedThresholdClassifierCV
is achieved utilizing post-processing methods, that are utilized on to the anticipated possibilities output by the mannequin. Nevertheless, it is essential to notice that TunedThresholdClassifierCV
makes use of cross-validation by default to seek out the choice threshold to keep away from overfitting to the coaching knowledge.
Price-sensitive studying
Price-sensitive studying is a kind of machine studying that assigns a price to every kind of misclassification. This interprets mannequin efficiency into items that stakeholders perceive, like {dollars} saved.
We are going to use the TELCO customer churn dataset, a binary classification dataset, to display the worth of cost-sensitive studying. The aim is to foretell whether or not a buyer will churn or not, given options in regards to the buyer’s demographics, contract particulars, and different technical details about the client’s account. The motivation to make use of this dataset (and a few of the code) is from Dan Becker’s course on decision threshold optimization.
knowledge = pd.read_excel("knowledge/Telco_customer_churn.xlsx")
drop_cols = [
"Count", "Country", "State", "Lat Long", "Latitude", "Longitude",
"Zip Code", "Churn Value", "Churn Score", "CLTV", "Churn Reason"
]
knowledge.drop(columns=drop_cols, inplace=True)# Preprocess the information
knowledge["Churn Label"] = knowledge["Churn Label"].map({"Sure": 1, "No": 0})
knowledge.drop(columns=["Total Charges"], inplace=True)
X_train, X_test, y_train, y_test = train_test_split(
knowledge.drop(columns=["Churn Label"]),
knowledge["Churn Label"],
test_size=0.2,
random_state=RANDOM_STATE,
stratify=knowledge["Churn Label"],
)
Arrange a primary pipeline for processing the information and producing predicted possibilities with a random forest mannequin. This can function a baseline to check to the TunedThresholdClassifierCV
.
preprocessor = ColumnTransformer(
transformers=[("one_hot", OneHotEncoder(),
selector(dtype_include="object"))],
the rest="passthrough",
)original_churn_model = make_pipeline(
preprocessor, RandomForestClassifier(random_state=RANDOM_STATE)
)
original_churn_model.match(X_train.drop(columns=["customerID"]), y_train);
The selection of preprocessing and mannequin kind shouldn’t be essential for this tutorial. The corporate desires to supply reductions to clients who’re predicted to churn. Throughout collaboration with stakeholders, you be taught that giving a reduction to a buyer who won’t churn (a false optimistic) would price $80. You additionally be taught that it’s value $200 to supply a reduction to a buyer who would have churned. You possibly can characterize this relationship in a price matrix:
def cost_function(y, y_pred, neg_label, pos_label):
cm = confusion_matrix(y, y_pred, labels=[neg_label, pos_label])
cost_matrix = np.array([[0, -80], [0, 200]])
return np.sum(cm * cost_matrix)cost_scorer = make_scorer(cost_function, neg_label=0, pos_label=1)
We additionally wrapped the associated fee perform in a scikit-learn customized scorer. This scorer will probably be used because the scoring
argument within the TunedThresholdClassifierCV and to judge revenue on the check set.
tuned_churn_model = TunedThresholdClassifierCV(
original_churn_model,
scoring=cost_scorer,
store_cv_results=True,
)tuned_churn_model.match(X_train.drop(columns=["CustomerID"]), y_train)
# Calculate the revenue on the check set
original_model_profit = cost_scorer(
original_churn_model, X_test.drop(columns=["CustomerID"]), y_test
)
tuned_model_profit = cost_scorer(
tuned_churn_model, X_test.drop(columns=["CustomerID"]), y_test
)
print(f"Authentic mannequin revenue: {original_model_profit}")
print(f"Tuned mannequin revenue: {tuned_model_profit}")
Authentic mannequin revenue: 29640
Tuned mannequin revenue: 35600
The revenue is larger within the tuned mannequin in comparison with the unique. Once more, we will plot the target metric in opposition to the choice thresholds to visualise the choice threshold choice on coaching knowledge throughout cross-validation:
fig, ax = plt.subplots(figsize=(5, 5))
ax.plot(
tuned_churn_model.cv_results_["thresholds"],
tuned_churn_model.cv_results_["scores"],
marker="o",
markersize=3,
linewidth=1e-3,
colour="#c0c0c0",
label="Goal rating (utilizing cost-matrix)",
)
ax.plot(
tuned_churn_model.best_threshold_,
tuned_churn_model.best_score_,
"^",
markersize=10,
colour="#ff6700",
label="Optimum cut-off level for the enterprise metric",
)
ax.legend()
ax.set_xlabel("Resolution threshold (likelihood)")
ax.set_ylabel("Goal rating (utilizing cost-matrix)")
ax.set_title("Goal rating as a perform of the choice threshold")
In actuality, assigning a static price to all situations which are misclassified in the identical manner shouldn’t be reasonable from a enterprise perspective. There are extra superior strategies to tune the edge by assigning a weight to every occasion within the dataset. That is lined in scikit-learn’s cost-sensitive learning example.
Tuning underneath constraints
This technique shouldn’t be lined within the scikit-learn documentation presently, however is a typical enterprise case for binary classification use circumstances. The tuning underneath constraint technique finds a choice threshold by figuring out some extent on both the ROC or precision-recall curves. The purpose on the curve is the utmost worth of 1 axis whereas constraining the opposite axis. For this walkthrough, we’ll be utilizing the Pima Indians diabetes dataset. It is a binary classification activity to foretell if a person has diabetes.
Think about that your mannequin will probably be used as a screening check for an average-risk inhabitants utilized to tens of millions of individuals. There are an estimated 38 million individuals with diabetes within the US. That is roughly 11.6% of the inhabitants, so the mannequin’s specificity must be excessive so it doesn’t misdiagnose tens of millions of individuals with diabetes and refer them to pointless confirmatory testing. Suppose your imaginary CEO has communicated that they won’t tolerate greater than a 2% false optimistic charge. Let’s construct a mannequin that achieves this utilizing TunedThresholdClassifierCV
.
For this a part of the tutorial, we’ll outline a constraint perform that will probably be used to seek out the utmost true optimistic charge at a 2% false optimistic charge.
def max_tpr_at_tnr_constraint_score(y_true, y_pred, max_tnr=0.5):
fpr, tpr, thresholds = roc_curve(y_true, y_pred, drop_intermediate=False)
tnr = 1 - fpr
tpr_at_tnr_constraint = tpr[tnr >= max_tnr].max()
return tpr_at_tnr_constraintmax_tpr_at_tnr_scorer = make_scorer(
max_tpr_at_tnr_constraint_score, max_tnr=0.98)
knowledge = pd.read_csv("knowledge/diabetes.csv")
X_train, X_test, y_train, y_test = train_test_split(
knowledge.drop(columns=["Outcome"]),
knowledge["Outcome"],
stratify=knowledge["Outcome"],
test_size=0.2,
random_state=RANDOM_STATE,
)
Construct two fashions, one logistic regression to function a baseline mannequin and the opposite, TunedThresholdClassifierCV
which can wrap the baseline logistic regression mannequin to attain the aim outlined by the CEO. Within the tuned mannequin, set scoring=max_tpr_at_tnr_scorer
. Once more, the selection of mannequin and preprocessing shouldn’t be essential for this tutorial.
# A baseline mannequin
original_model = make_pipeline(
StandardScaler(), LogisticRegression(random_state=RANDOM_STATE)
)
original_model.match(X_train, y_train)# A tuned mannequin
tuned_model = TunedThresholdClassifierCV(
original_model,
thresholds=np.linspace(0, 1, 150),
scoring=max_tpr_at_tnr_scorer,
store_cv_results=True,
cv=8,
random_state=RANDOM_STATE,
)
tuned_model.match(X_train, y_train)
Evaluate the distinction between the default determination threshold from scikit-learn estimators, 0.5, and one discovered utilizing the tuning underneath constraint strategy on the ROC curve.
# Get the fpr and tpr of the unique mannequin
original_model_proba = original_model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, original_model_proba)
closest_threshold_to_05 = (np.abs(thresholds - 0.5)).argmin()
fpr_orig = fpr[closest_threshold_to_05]
tpr_orig = tpr[closest_threshold_to_05]# Get the tnr and tpr of the tuned mannequin
max_tpr = tuned_model.best_score_
constrained_tnr = 0.98
# Plot the ROC curve and examine the default threshold to the tuned threshold
fig, ax = plt.subplots(figsize=(5, 5))
# Word that this would be the identical for each fashions
disp = RocCurveDisplay.from_estimator(
original_model,
X_test,
y_test,
title="Logistic Regression",
colour="#c0c0c0",
linewidth=2,
ax=ax,
)
disp.ax_.plot(
1 - constrained_tnr,
max_tpr,
label=f"Tuned threshold: {tuned_model.best_threshold_:.2f}",
colour="#ff6700",
linestyle="--",
marker="o",
markersize=11,
)
disp.ax_.plot(
fpr_orig,
tpr_orig,
label="Default threshold: 0.5",
colour="#004e98",
linestyle="--",
marker="X",
markersize=11,
)
disp.ax_.set_ylabel("True Optimistic Price", fontsize=8)
disp.ax_.set_xlabel("False Optimistic Price", fontsize=8)
disp.ax_.tick_params(labelsize=8)
disp.ax_.legend(fontsize=7)
The tuned underneath constraint technique discovered a threshold of 0.80, which resulted in a mean sensitivity of 19.2% throughout cross-validation of the coaching knowledge. Evaluate the sensitivity and specificity to see how the edge holds up within the check set. Did the mannequin meet the CEO’s specificity requirement within the check set?
# Common sensitivity and specificity on the coaching set
avg_sensitivity_train = tuned_model.best_score_# Name predict from tuned_model to calculate sensitivity and specificity on the check set
specificity_test = recall_score(
y_test, tuned_model.predict(X_test), pos_label=0)
sensitivity_test = recall_score(y_test, tuned_model.predict(X_test))
print(f"Common sensitivity on the coaching set: {avg_sensitivity_train:.3f}")
print(f"Sensitivity on the check set: {sensitivity_test:.3f}")
print(f"Specificity on the check set: {specificity_test:.3f}")
Common sensitivity on the coaching set: 0.192
Sensitivity on the check set: 0.148
Specificity on the check set: 0.990
Conclusion
The brand new TunedThresholdClassifierCV
class is a strong software that may assist you to grow to be a greater knowledge scientist by sharing with enterprise leaders the way you arrived at a choice threshold. You discovered the best way to use the brand new scikit-learn TunedThresholdClassifierCV
class to maximise a metric, carry out cost-sensitive studying, and tune a metric underneath constraint. This tutorial was not meant to be complete or superior. I wished to introduce the brand new function and spotlight its energy and suppleness in fixing binary classification issues. Please try the scikit-learn documentation, person information, and examples for thorough utilization examples.
An enormous shoutout to Guillaume Lemaitre for his work on this function.
Thanks for studying. Completely satisfied tuning.
Knowledge Licenses:
Bank card fraud: DbCL
Pima Indians diabetes: CC0
TELCO churn: commercial use OK