libuplift.datasets.pbc#

The pbc datasets from R survival package.

Functions#

fetch_pbc([data_home, download_if_missing, ...])

Load the pbc dataset from R survival package (uplift survival).

Module Contents#

libuplift.datasets.pbc.fetch_pbc(data_home=None, download_if_missing=True, random_state=None, shuffle=False, categ_as_strings=False, return_X_y=False, as_frame=False)[source]#

Load the pbc dataset from R survival package (uplift survival).

Download it if necessary.

Only first 312 records with assigned treatment are kept.

Following the original dataset, the edema variable is numerical

but can also be treated as categorical: 0 no edema, 0.5 untreated or successfully treated, 1 edema despite diuretic therapy

Variables

chol, copper, trig, platelet contain missing data

Parameters:
data_homestring, optional

Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

download_if_missingboolean, default=True

If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site.

random_stateint, RandomState instance or None (default)

Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls.

shufflebool, default=False

Whether to shuffle dataset.

categ_as_stringsbool, default=False

Whether to return categorical variables as strings.

return_X_yboolean, default=False.

If True, returns (data.data, data.target) instead of a Bunch object.

as_frameboolean, default=False

If True features are returned as pandas DataFrame. If False features are returned as object or float array. Float array is returned if all features are floats.

Returns:
datasetdict-like object with the following attributes:
dataset.datanumpy array

Each row corresponds to the features in the dataset.

dataset.target_statusnumpy array

Censoring status: 0=censored, 1=transplant, 2=dead.

dataset.target_timenumpy array

Censoring, transplant or death time.

dataset.DESCRstring

Description of the dataset.

(data, target_time, target_status)tuple if

return_X_y is True