Skip to contents

Split high-dimensional cytometry data into a training and test set

Usage

tof_split_data(
  feature_tibble,
  split_method = c("k-fold", "bootstrap", "simple"),
  split_col,
  simple_prop = 3/4,
  num_cv_folds = 10,
  num_cv_repeats = 1L,
  num_bootstraps = 10,
  strata = NULL,
  ...
)

Arguments

feature_tibble

A tibble in which each row represents a sample- or patient- level observation, such as those produced by tof_extract_features.

split_method

Either a string or a logical vector specifying how to perform the split. If a string, valid options include k-fold cross validation ("k-fold"; the default), bootstrapping ("bootstrap"), or a single binary split ("simple"). If a logical vector, it should contain one entry for each row in `feature_tibble` indicating if that row should be included in the training set (TRUE) or excluded for the validation/test set (FALSE). Ignored entirely if `split_col` is specified.

split_col

The unquoted column name of the logical column in `feature_tibble` indicating if each row should be included in the training set (TRUE) or excluded for the validation/test set (FALSE).

simple_prop

A numeric value between 0 and 1 indicating what proportion of the data should be used for training. Defaults to 3/4. Ignored if split_method is not "simple".

num_cv_folds

An integer indicating how many cross-validation folds should be used. Defaults to 10. Ignored if split_method is not "k-fold".

num_cv_repeats

An integer indicating how many independent cross-validation replicates should be used (i.e. how many num_cv_fold splits should be performed). Defaults to 1. Ignored if split_method is not "k-fold".

num_bootstraps

An integer indicating how many independent bootstrap replicates should be used. Defaults to 25. Ignored if split_method is not "bootstrap".

strata

An unquoted column name representing the column in feature_tibble that should be used to stratify the data splitting. Defaults to NULL (no stratification).

...

Optional additional arguments to pass to vfold_cv for k-fold cross validation, bootstraps for bootstrapping, or initial_split for simple splitting.

Value

If for k-fold cross validation and bootstrapping, an "rset" object; for simple splitting, an "rsplit" object. For details, see rsample.

See also

Other modeling functions: tof_assess_model(), tof_create_grid(), tof_predict(), tof_train_model()

Examples

feature_tibble <-
    dplyr::tibble(
        sample = as.character(1:100),
        cd45 = runif(n = 100),
        pstat5 = runif(n = 100),
        cd34 = runif(n = 100),
        outcome = (3 * cd45) + (4 * pstat5) + rnorm(100),
        class =
            as.factor(
                dplyr::if_else(outcome > median(outcome), "class1", "class2")
            ),
        multiclass =
            as.factor(
                c(rep("class1", 30), rep("class2", 30), rep("class3", 40))
            ),
        event = c(rep(0, times = 50), rep(1, times = 50)),
        time_to_event = rnorm(n = 100, mean = 10, sd = 2)
    )

# split the dataset into 10 CV folds
tof_split_data(
    feature_tibble = feature_tibble,
    split_method = "k-fold"
)
#> #  10-fold cross-validation 
#> # A tibble: 10 × 2
#>    splits          id    
#>    <list>          <chr> 
#>  1 <split [90/10]> Fold01
#>  2 <split [90/10]> Fold02
#>  3 <split [90/10]> Fold03
#>  4 <split [90/10]> Fold04
#>  5 <split [90/10]> Fold05
#>  6 <split [90/10]> Fold06
#>  7 <split [90/10]> Fold07
#>  8 <split [90/10]> Fold08
#>  9 <split [90/10]> Fold09
#> 10 <split [90/10]> Fold10

# split the dataset into 10 bootstrap resamplings
tof_split_data(
    feature_tibble = feature_tibble,
    split_method = "bootstrap"
)
#> # Bootstrap sampling 
#> # A tibble: 10 × 2
#>    splits           id         
#>    <list>           <chr>      
#>  1 <split [100/36]> Bootstrap01
#>  2 <split [100/39]> Bootstrap02
#>  3 <split [100/34]> Bootstrap03
#>  4 <split [100/39]> Bootstrap04
#>  5 <split [100/31]> Bootstrap05
#>  6 <split [100/35]> Bootstrap06
#>  7 <split [100/32]> Bootstrap07
#>  8 <split [100/39]> Bootstrap08
#>  9 <split [100/36]> Bootstrap09
#> 10 <split [100/38]> Bootstrap10

# split the dataset into a single training/test set
# stratified by the "class" column
tof_split_data(
    feature_tibble = feature_tibble,
    split_method = "simple",
    strata = class
)
#> <Training/Testing/Total>
#> <74/26/100>