Welcome to libuplift’s documentation!#
Contents:
libuplift#
libuplift is an uplift modeling package based on and integrated with scikit-learn.
Authors: Szymon Jaroszewicz, Krzysztof Rudaś
Design goals#
The design goal of libuplift is to seamlessly integrate with scikit-learn (https://www.scikit-learn.org) and follow its conventions as closely as possible. It is possible to use model evaluation and tuning facilities from scikit-learn either directly or as thin wrappers provided by libuplift.
Features#
A comprehensive collection of datasets for uplift modeling (we believe this is the most complete collection of randomized datasets) * marketing and advertising datasets * medical RTC datasets
Tight integration with scikit-learn: model evaluation routines can be used just as in scikit-learn
Meta-models: T/S/X Learners, transformed target learner
Getting started#
To install libuplift simply use:
pip install libuplift
or to get the latest version install directly from Github:
pip install git+https://github.com/jszymon/libuplift
Let us now build an uplift model on the well known Hillstrom dataset. Begin with the necessary imports:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
Now fetch the dataset and do basic preprocessing:
from libuplift.datasets import fetch_Hillstrom
D = fetch_Hillstrom(as_frame=True)
trt = D.treatment
# encode categorical features, standardize numerical features
ct = ColumnTransformer([("ohe", OneHotEncoder(), list(D.categ_values.keys()))],
remainder=StandardScaler())
X = ct.fit_transform(D.data)
# keep only women's campaign
mask = ~(trt == 1)
X = X[mask]
y = D.target_visit[mask]
trt = (trt[mask] == 2)*1
By libuplift convention, treatments are denoted by successive integers with 0 indicating controls. Additionally the special n_trt argument is passed to all methods to indicate the number of treatments (if n_trt is None it will be inferred automatically, but this may be unreliable and is discouraged).
Now, we’re ready to fit an uplift model (TLearner in our case):
X_train, X_test, y_train, y_test, trt_train, trt_test = train_test_split(X, y, trt, train_size=0.7)
m = TLearnerUpliftClassifier(base_estimator=LogisticRegression())
m.fit(X_train, y_train, trt_train, n_trt=1)
and draw an uplift curve:
import matplotlib.pyplot as plt
from libuplift.metrics import uplift_curve, area_under_uplift_curve
score = m.predict(X_test)[:,1]
print("AUUC=", area_under_uplift_curve(y_test, score, trt_test, n_trt=1))
cx, cy = uplift_curve(y_test, score, trt_test, n_trt=1)
plt.plot(cx, cy)
plt.plot([0,1], [0,cy[-1]], "k-")
plt.show()
An uplift curve#
One can use cross_val_score and GridSearchCV to easily evaluate models or tune their parameters, just as one does in scikit-learn. The functions provided by libuplift are thin wrappers of original scikit-lelearn functions so they behave exactly the same as they would for standard classifiers.
# import those from libuplift instead of sklearn
from libuplift.model_selection import cross_val_score
from libuplift.model_selection import GridSearchCV
m1 = TLearnerUpliftClassifier(base_estimator=LogisticRegression())
m_cv1 = GridSearchCV(m1,
{"base_estimator__C":[1e-1,1,1e1,1e2,1e3]},
cv=3, n_jobs=-1)
# tune regularization of treatment/control models separately
m2 = TLearnerUpliftClassifier(base_estimator=[("model_c", LogisticRegression()),
("model_t", LogisticRegression())])
m_cv2 = GridSearchCV(m2,
{"model_c__C":[1e-1,1,1e1,1e2,1e3],
"model_t__C":[1e-1,1,1e1,1e2,1e3]},
cv=3, n_jobs=-1)
Now evaluate both models using crossvalidated Area Under Uplift Curve:
auuc_m1 = np.mean(cross_val_score(m_cv1, X, y, trt, n_trt=1, cv=5, scoring="auuc"))
auuc_m2 = np.mean(cross_val_score(m_cv2, X, y, trt, n_trt=1, cv=5, scoring="auuc"))
print("crossval AUUC m1:", auuc_m1)
print("crossval AUUC m2:", auuc_m2)
Finally, do a permutation test and draw a learning curve. Again the functions below are thin wrappers of original scikit-learn functions so they accept the same set of parameters.
from libuplift.model_selection import permutation_test_score, learning_curve
score, permutation_scores, pv =\\
permutation_test_score(m, X, y, trt, n_trt=1, cv=3,
n_permutations=100, scoring="auuc",
verbose=10, n_jobs=-1)
fix, (ax0, ax1) = plt.subplots(ncols=2)
ax0.hist(permutation_scores, density=True, label=f"p-value={pv}")
ax0.axvline(score, color="r")
ax0.set_title("Permutation test")
train_sizes, train_scores, test_scores = learning_curve(m, X, y, trt, n_trt=1, scoring="auuc")
train_scores_mean = train_scores.mean(axis=1)
train_scores_std = train_scores.std(axis=1)
test_scores_mean = test_scores.mean(axis=1)
test_scores_std = test_scores.std(axis=1)
ax1.fill_between(train_sizes,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.1, color='r')
ax1.plot(train_sizes, train_scores_mean, 'ro-', label="Train score")
ax1.fill_between(train_sizes,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.1, color='g')
ax1.plot(train_sizes, test_scores_mean, 'go-', label="Test score")
ax1.legend()
ax1.yaxis.tick_right()
ax1.set_title("Learning curve")
plt.show()
Model evaluation plots: Permutation test (left) and Learning curve (right).#
We can see that the model is significantly better than random guessing and optimal performance seems to be achieved already with 10000 training records.