libuplift.datasets.actg320#

The actg320 trial data from Hosmer, Lemeshow and May.

Functions#

fetch_actg320([data_home, download_if_missing, ...])

Load the actg320 AIDS treatment clinical trial dataset.

Module Contents#

libuplift.datasets.actg320.fetch_actg320(data_home=None, download_if_missing=True, random_state=None, shuffle=False, categ_as_strings=False, return_X_y=False, as_frame=False)[source]#

Load the actg320 AIDS treatment clinical trial dataset.

Download it if necessary.

This is a randomized clinical trial dataset of various AIDS treatments from [2].

The description of the original study can be found in [1].

The main treatment variable indicates whether treatment includes IDV (Indinavir). The treatment_grp variable contains one of four specific treatments given: 1 = ZDV + 3TC 2 = ZDV + 3TC + IDV 3 = d4T + 3TC 4 = d4T + 3TC + IDV (treatments 3 and 4 were given in only 3 cases)

Treatment assignment was stratified on strat2 variable (CD4 count).

Target variables: time/censor: time/censoring to occurrence of AIDS or death time_d/censor_d: time/censoring to occurrence of death

Variables

strat2

CD4 stratum at screening 0: CD4 <= 50, 1: > 50

sex

1 = Male, 2 = Female

raceth

Race/Ethnicity: 1 = White Non-Hispanic 2 = Black Non-Hispanic 3 = Hispanic (regardless of race) 4 = Asian, Pacific Islander 5 = American Indian, Alaskan Native 6 = Other/unknown

ivdrug

IV drug use history: 1 = Never 2 = Currently 3 = Previously

hemophil

Hemophiliac

karnof

Karnofsky Performance Scale

cd4

Baseline CD4 count [Cells/milliliter]

priorzdv

Months of prior ZDV use [months]

age

Age at Enrollment [years]

Parameters:
include_location_varsboolean, default=True

Should variables describing hospital locations be included. These are categorical variables with large number of levels. The removed variables are regl, grpl, grps

data_homestring, optional

Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

download_if_missingboolean, default=True

If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site.

random_stateint, RandomState instance or None (default)

Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls.

shufflebool, default=False

Whether to shuffle dataset.

categ_as_stringsbool, default=False

Whether to return categorical variables as strings.

return_X_yboolean, default=False.

If True, returns (data.data, data.target) instead of a Bunch object.

as_frameboolean, default=False

If True features are returned as pandas DataFrame. If False features are returned as object or float array. Float array is returned if all features are floats.

Returns:
datasetdict-like object with the following attributes:
dataset.datanumpy array

Each row corresponds to the features in the dataset.

dataset.targetnumpy array

Each value is 1 if a purchase was made 0 otherwise.

dataset.DESCRstring

Description of the dataset.

(data, target)tuple if return_X_y is True

References

[1]

S.M. Hammer, et al., “A Controlled Trial of Two Nucleoside Analogues plus Indinavir in Persons with Human Immunodeficiency Virus Infection and CD4 Cell Counts of 200 per Cubic Millimeter or Less”, New England Journal of Medicine, 337(11), 725–733, 1997 (https://www.nejm.org/doi/10.1056/NEJM199709113371101)

[2]

Hosmer, D.W. and Lemeshow, S. and May, S., Applied Survival Analysis: Regression Modeling of Time to Event Data: Second Edition, John Wiley and Sons Inc., New York, NY, 2008.