Skip to contents

This feature extraction function calculates a user-specified measurement of central tendency (i.e. median or mode) of the cells in each cluster in a `tof_tibble` across a user-specified selection of CyTOF markers. These calculations can be done either overall (across all cells in the dataset) or after breaking down the cells into subgroups using `group_cols`.

Usage

tof_extract_central_tendency(
  tof_tibble,
  cluster_col,
  group_cols = NULL,
  marker_cols = where(tof_is_numeric),
  stimulation_col = NULL,
  central_tendency_function = stats::median,
  format = c("wide", "long")
)

Arguments

tof_tibble

A `tof_tibble` or a `tibble` in which each row represents a single cell and each column represents a CyTOF measurement or a piece of metadata (i.e. cluster id, patient id, etc.) about each cell.

cluster_col

An unquoted column name indicating which column in `tof_tibble` stores the cluster ids of the cluster to which each cell belongs. Cluster labels can be produced via any method the user chooses - including manual gating, any of the functions in the `tof_cluster_*` function family, or any other method.

group_cols

Unquoted column names representing which columns in `tof_tibble` should be used to break the rows of `tof_tibble` into subgroups for the feature extraction calculation. Defaults to NULL (i.e. performing the extraction without subgroups).

marker_cols

Unquoted column names representing which columns in `tof_tibble` (i.e. which CyTOF protein measurements) should be included in the feature extraction calculation. Defaults to all numeric (integer or double) columns. Supports tidyselection.

stimulation_col

Optional. An unquoted column name that indicates which column in `tof_tibble` contains information about which stimulation condition each cell was exposed to during data acquisition. If provided, the feature extraction will be further broken down into subgroups by stimulation condition (and features from each stimulation condition will be included as their own features in wide format).

central_tendency_function

The function that will be used to calculate the measurement of central tendency for each cluster (to be used as the dependent variable in the linear model). Defaults to median.

format

A string indicating if the data should be returned in "wide" format (the default; each cluster feature is given its own column) or in "long" format (each cluster feature is provided as its own row).

Value

A tibble.

If format == "wide", the tibble will have 1 row for each combination of the grouping variables provided in `group_cols` and one column for each grouping variable, one column for each extracted feature (the central tendency of a given marker in a given cluster). The names of each column containing cluster features is obtained using the following pattern: "{marker_id}@{cluster_id}_ct".

If format == "long", the tibble will have 1 row for each combination of the grouping variables in `group_cols`, each cluster id (i.e. level) in `cluster_col`, and each marker in `marker_cols`. It will have one column for each grouping variable, one column for the cluster ids, one column for the CyTOF channel names, and one column (`value`) containing the features.

See also

Examples

sim_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 1000),
        cd38 = rnorm(n = 1000),
        cd34 = rnorm(n = 1000),
        cd19 = rnorm(n = 1000),
        cluster_id = sample(letters, size = 1000, replace = TRUE),
        patient = sample(c("kirby", "mario"), size = 1000, replace = TRUE),
        stim = sample(c("basal", "stim"), size = 1000, replace = TRUE)
    )

# extract proportion of each cluster in each patient in wide format
tof_extract_central_tendency(
    tof_tibble = sim_data,
    cluster_col = cluster_id,
    group_cols = patient
)
#> # A tibble: 2 × 105
#>   patient `cd45@a_ct` `cd38@a_ct` `cd34@a_ct` `cd19@a_ct` `cd45@b_ct`
#>   <chr>         <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
#> 1 kirby         0.311       0.281      -0.296      -0.259      -0.129
#> 2 mario         0.552       0.210       0.490       0.876      -0.670
#> # ℹ 99 more variables: `cd38@b_ct` <dbl>, `cd34@b_ct` <dbl>, `cd19@b_ct` <dbl>,
#> #   `cd45@c_ct` <dbl>, `cd38@c_ct` <dbl>, `cd34@c_ct` <dbl>, `cd19@c_ct` <dbl>,
#> #   `cd45@d_ct` <dbl>, `cd38@d_ct` <dbl>, `cd34@d_ct` <dbl>, `cd19@d_ct` <dbl>,
#> #   `cd45@e_ct` <dbl>, `cd38@e_ct` <dbl>, `cd34@e_ct` <dbl>, `cd19@e_ct` <dbl>,
#> #   `cd45@f_ct` <dbl>, `cd38@f_ct` <dbl>, `cd34@f_ct` <dbl>, `cd19@f_ct` <dbl>,
#> #   `cd45@g_ct` <dbl>, `cd38@g_ct` <dbl>, `cd34@g_ct` <dbl>, `cd19@g_ct` <dbl>,
#> #   `cd45@h_ct` <dbl>, `cd38@h_ct` <dbl>, `cd34@h_ct` <dbl>, …

# extract proportion of each cluster in each patient in long format
tof_extract_central_tendency(
    tof_tibble = sim_data,
    cluster_col = cluster_id,
    group_cols = patient,
    format = "long"
)
#> # A tibble: 208 × 4
#>    patient cluster_id channel  values
#>    <chr>   <chr>      <chr>     <dbl>
#>  1 kirby   a          cd45     0.311 
#>  2 kirby   a          cd38     0.281 
#>  3 kirby   a          cd34    -0.296 
#>  4 kirby   a          cd19    -0.259 
#>  5 kirby   b          cd45    -0.129 
#>  6 kirby   b          cd38    -0.217 
#>  7 kirby   b          cd34    -0.0910
#>  8 kirby   b          cd19     0.168 
#>  9 kirby   c          cd45    -0.0456
#> 10 kirby   c          cd38     0.0657
#> # ℹ 198 more rows