libuplift.datasets.IST#

The International Stroke Trial dataset.

This is a randomized clinical trial of heparin and aspirin treatment for stroke patients.

This dataset is derived from the corrected dataset available here: https://datashare.ed.ac.uk/handle/10283/128 The webpage contains detailed descriptions.

This version only includes pre-randomization variables, two targets, and several additional targets related to side effects.

Functions#

fetch_IST([include_pilot, include_location_vars, ...])

Load the International Stroke Trial (IST) dataset.

Module Contents#

libuplift.datasets.IST.fetch_IST(include_pilot=True, include_location_vars=True, include_prediction_model_vars=True, data_home=None, download_if_missing=True, random_state=None, shuffle=False, categ_as_strings=False, return_X_y=False, as_frame=False)[source]#

Load the International Stroke Trial (IST) dataset.

Download it if necessary.

This is a randomized clinical trial of heparin and aspirin treatment for stroke patients.

This dataset is derived from the corrected dataset available here: https://datashare.ed.ac.uk/handle/10283/128 The webpage contains detailed descriptions.

This version only includes pre-randomization variables, two main targets, and several additional targets related to side effects.

The two main targets are: target_ID14 - death after 14 days target_OCCODE - outcome after 6 month. Original study used (“dead” or “dependent”) as outcome of interest

Additionally there are 9 targets describing side effects at 14 days: target_H14, target_ISC14, target_NK14, target_STRK14, target_HTI14, target_PE14, target_DVT14, target_TRAN14, target_NCB14

Variables

See https://datashare.ed.ac.uk/handle/10283/128

Changes to the original dataset

Only pretreatment variables, variables describing outcomes at 14 days and 6 month outcome code are included
Change all N/Y variables to 0/1
Level H of RXHEP recoded as M for pilot study cases
Add var IS_PILOT indicating pilot study obtained by testing if RHEP24 is NaN. The variable is only added if include_pilot is True.
RDATE variable has been split into RYEAR and RMONTH, month names have been translated to English
Recoded OCCODE to descriptive values, merge two “missing status” categories to “NA”

Parameters:

include_pilotboolean, default=True: Whether to include records from a pilot study with 984 patients. Some values (RATRIAL and RASP3) are missing in the pilot.
include_location_varsboolean, default=True: Should variables describing hospitals and their locations be included. These are categorical variables with large number of levels. The variables are: HOSPNUM, COUNTRY
data_homestring, optional: Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.
download_if_missingboolean, default=True: If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site.
random_stateint, RandomState instance or None (default): Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls.
shufflebool, default=False: Whether to shuffle dataset.
categ_as_stringsbool, default=False: Whether to return categorical variables as strings.
return_X_yboolean, default=False.: If True, returns (data.data, data.target) instead of a Bunch object.
as_frameboolean, default=False: If True features are returned as pandas DataFrame. If False features are returned as object or float array. Float array is returned if all features are floats.

Returns:

datasetdict-like object with the following attributes:
dataset.datanumpy array: Each row corresponds to the features in the dataset.
dataset.targetnumpy array: Each value is 1 if a purchase was made 0 otherwise.
dataset.DESCRstring: Description of the dataset.
(data, target)tuple if return_X_y is True

libuplift.datasets.IST#

Functions#

Module Contents#

This Page