Split high-dimensional cytometry data into a training and test set
Source:R/patient-level_modeling.R
tof_split_data.Rd
Split high-dimensional cytometry data into a training and test set
Usage
tof_split_data(
feature_tibble,
split_method = c("k-fold", "bootstrap", "simple"),
split_col,
simple_prop = 3/4,
num_cv_folds = 10,
num_cv_repeats = 1L,
num_bootstraps = 10,
strata = NULL,
...
)
Arguments
- feature_tibble
A tibble in which each row represents a sample- or patient- level observation, such as those produced by
tof_extract_features
.- split_method
Either a string or a logical vector specifying how to perform the split. If a string, valid options include k-fold cross validation ("k-fold"; the default), bootstrapping ("bootstrap"), or a single binary split ("simple"). If a logical vector, it should contain one entry for each row in `feature_tibble` indicating if that row should be included in the training set (TRUE) or excluded for the validation/test set (FALSE). Ignored entirely if `split_col` is specified.
- split_col
The unquoted column name of the logical column in `feature_tibble` indicating if each row should be included in the training set (TRUE) or excluded for the validation/test set (FALSE).
- simple_prop
A numeric value between 0 and 1 indicating what proportion of the data should be used for training. Defaults to 3/4. Ignored if split_method is not "simple".
- num_cv_folds
An integer indicating how many cross-validation folds should be used. Defaults to 10. Ignored if split_method is not "k-fold".
- num_cv_repeats
An integer indicating how many independent cross-validation replicates should be used (i.e. how many num_cv_fold splits should be performed). Defaults to 1. Ignored if split_method is not "k-fold".
- num_bootstraps
An integer indicating how many independent bootstrap replicates should be used. Defaults to 25. Ignored if split_method is not "bootstrap".
- strata
An unquoted column name representing the column in
feature_tibble
that should be used to stratify the data splitting. Defaults to NULL (no stratification).- ...
Optional additional arguments to pass to
vfold_cv
for k-fold cross validation,bootstraps
for bootstrapping, orinitial_split
for simple splitting.
Value
If for k-fold cross validation and bootstrapping, an "rset" object;
for simple splitting, an "rsplit" object. For details, see
rsample
.
See also
Other modeling functions:
tof_assess_model()
,
tof_create_grid()
,
tof_predict()
,
tof_train_model()
Examples
feature_tibble <-
dplyr::tibble(
sample = as.character(1:100),
cd45 = runif(n = 100),
pstat5 = runif(n = 100),
cd34 = runif(n = 100),
outcome = (3 * cd45) + (4 * pstat5) + rnorm(100),
class =
as.factor(
dplyr::if_else(outcome > median(outcome), "class1", "class2")
),
multiclass =
as.factor(
c(rep("class1", 30), rep("class2", 30), rep("class3", 40))
),
event = c(rep(0, times = 50), rep(1, times = 50)),
time_to_event = rnorm(n = 100, mean = 10, sd = 2)
)
# split the dataset into 10 CV folds
tof_split_data(
feature_tibble = feature_tibble,
split_method = "k-fold"
)
#> # 10-fold cross-validation
#> # A tibble: 10 × 2
#> splits id
#> <list> <chr>
#> 1 <split [90/10]> Fold01
#> 2 <split [90/10]> Fold02
#> 3 <split [90/10]> Fold03
#> 4 <split [90/10]> Fold04
#> 5 <split [90/10]> Fold05
#> 6 <split [90/10]> Fold06
#> 7 <split [90/10]> Fold07
#> 8 <split [90/10]> Fold08
#> 9 <split [90/10]> Fold09
#> 10 <split [90/10]> Fold10
# split the dataset into 10 bootstrap resamplings
tof_split_data(
feature_tibble = feature_tibble,
split_method = "bootstrap"
)
#> # Bootstrap sampling
#> # A tibble: 10 × 2
#> splits id
#> <list> <chr>
#> 1 <split [100/36]> Bootstrap01
#> 2 <split [100/39]> Bootstrap02
#> 3 <split [100/34]> Bootstrap03
#> 4 <split [100/39]> Bootstrap04
#> 5 <split [100/31]> Bootstrap05
#> 6 <split [100/35]> Bootstrap06
#> 7 <split [100/32]> Bootstrap07
#> 8 <split [100/39]> Bootstrap08
#> 9 <split [100/36]> Bootstrap09
#> 10 <split [100/38]> Bootstrap10
# split the dataset into a single training/test set
# stratified by the "class" column
tof_split_data(
feature_tibble = feature_tibble,
split_method = "simple",
strata = class
)
#> <Training/Testing/Total>
#> <74/26/100>