Skip to contents

This function wraps other members of the `tof_extract_*` function family to extract sample-level features from both lineage (i.e. cell surface antigen) CyTOF channels assumed to be stable across stimulation conditions and signaling CyTOF channels assumed to change across stimulation conditions. Features are extracted for each cluster within each independent sample (as defined with the `group_cols` argument).

Usage

tof_extract_features(
  tof_tibble,
  cluster_col,
  group_cols = NULL,
  stimulation_col = NULL,
  lineage_cols,
  signaling_cols,
  central_tendency_function = stats::median,
  signaling_method = c("threshold", "emd", "jsd", "central tendency"),
  basal_level = NULL,
  ...
)

Arguments

tof_tibble

A `tof_tbl` or a `tibble`.

cluster_col

An unquoted column name indicating which column in `tof_tibble` stores the cluster ids of the cluster to which each cell belongs. Cluster labels can be produced via any method the user chooses - including manual gating, any of the functions in the `tof_cluster_*` function family, or any other method.

group_cols

Unquoted column names representing which columns in `tof_tibble` should be used to break the rows of `tof_tibble` into subgroups for the feature extraction calculation. Defaults to NULL (i.e. performing the extraction without subgroups).

stimulation_col

Optional. An unquoted column name that indicates which column in `tof_tibble` contains information about which stimulation condition each cell was exposed to during data acquisition. If provided, the feature extraction will be further broken down into subgroups by stimulation condition (and features from each stimulation condition will be included as their own features in wide format).

lineage_cols

Unquoted column names representing which columns in `tof_tibble` (i.e. which CyTOF protein measurements) should be considered lineage markers in the feature extraction calculation. Supports tidyselect helpers.

signaling_cols

Unquoted column names representing which columns in `tof_tibble` (i.e. which CyTOF protein measurements) should be considered signaling markers in the feature extraction calculation. Supports tidyselect helpers.

central_tendency_function

The function that will be used to calculate the measurement of central tendency for each cluster (to be used as the dependent variable in the linear model). Defaults to median.

signaling_method

A string indicating which feature extraction method to use for signaling markers (as identified by the `signaling_cols` argument). Options are "threshold" (the default), "emd", "jsd", and "central tendency".

basal_level

A string indicating what the value in `stimulation_col` corresponds to the basal stimulation condition (i.e. "basal" or "unstimulated").

...

Optional additional arguments to be passed to tof_extract_threshold, tof_extract_emd, or tof_extract_jsd.

Value

A tibble.

The output tibble will have 1 row for each combination of the grouping variables provided in `group_cols` (thus, each row will represent what is considered a single "sample" based on the grouping provided). It will have one column for each grouping variable and one column for each extracted feature ("wide" format).

Details

Lineage channels are specified using the `lineage_cols` argument, and their extracted features will be measurements of central tendency (as computed by the user-supplied `central_tendency_function`).

Signaling channels are specified using the `signaling_cols` argument, and their extracted features will depend on the user's chosen `signaling_method`. If `signaling method` == "threshold" (the default), tof_extract_threshold will be used to calculate the proportion of cells in each cluster with signaling marker expression over `threshold` in each stimulation condition. If `signaling_method` == "emd" or `signaling_method` == "jsd", tof_extract_emd or tof_extract_jsd will be used to calculate the earth-mover's distance (EMD) or Jensen-Shannon Distance (JSD), respectively, between the basal condition and each of the stimulated conditions in each cluster for each sample. Finally, if none of these options are chosen, tof_extract_central_tendency will be used to calculate measurements of central tendency.

In addition, tof_extract_proportion will be used to extract the proportion of cells in each cluster will be computed for each sample.

These calculations can be performed either overall (across all cells in the dataset) or after breaking down the cells into subgroups using `group_cols`.

Examples

sim_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 1000),
        cd38 = rnorm(n = 1000),
        cd34 = rnorm(n = 1000),
        cd19 = rnorm(n = 1000),
        cluster_id = sample(letters, size = 1000, replace = TRUE),
        patient = sample(c("kirby", "mario"), size = 1000, replace = TRUE),
        stim = sample(c("basal", "stim"), size = 1000, replace = TRUE)
    )

# extract the following features from each cluster in each
# patient/stimulation:
#    - proportion of each cluster
#    - central tendency (median) of cd45 and cd38 in each cluster
#    - the proportion of cells in each cluster with cd34 expression over
#      the default threshold (asinh(10 / 5))
tof_extract_features(
    tof_tibble = sim_data,
    cluster_col = cluster_id,
    group_cols = patient,
    lineage_cols = c(cd45, cd38),
    signaling_cols = cd34,
    stimulation_col = stim
)
#> # A tibble: 2 × 131
#>   patient `prop@a` `prop@b` `prop@c` `prop@d` `prop@e` `prop@f` `prop@g`
#>   <chr>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1 kirby     0.0356   0.0413   0.0507   0.0281   0.0338   0.0281   0.0432
#> 2 mario     0.0236   0.0343   0.0407   0.0450   0.0535   0.0450   0.0493
#> # ℹ 123 more variables: `prop@h` <dbl>, `prop@i` <dbl>, `prop@j` <dbl>,
#> #   `prop@k` <dbl>, `prop@l` <dbl>, `prop@m` <dbl>, `prop@n` <dbl>,
#> #   `prop@o` <dbl>, `prop@p` <dbl>, `prop@q` <dbl>, `prop@r` <dbl>,
#> #   `prop@s` <dbl>, `prop@t` <dbl>, `prop@u` <dbl>, `prop@v` <dbl>,
#> #   `prop@w` <dbl>, `prop@x` <dbl>, `prop@y` <dbl>, `prop@z` <dbl>,
#> #   `cd45@a_ct` <dbl>, `cd38@a_ct` <dbl>, `cd45@b_ct` <dbl>, `cd38@b_ct` <dbl>,
#> #   `cd45@c_ct` <dbl>, `cd38@c_ct` <dbl>, `cd45@d_ct` <dbl>, …

# extract the following features from each cluster in each
# patient/stimulation:
#    - proportion of each cluster
#    - central tendency (mean) of cd45 and cd38 in each cluster
#    - the earth mover's distance between each cluster's cd34 histogram in
#      the "basal" and "stim" conditions
tof_extract_features(
    tof_tibble = sim_data,
    cluster_col = cluster_id,
    group_cols = patient,
    lineage_cols = c(cd45, cd38),
    signaling_cols = cd34,
    central_tendency_function = mean,
    stimulation_col = stim,
    signaling_method = "emd",
    basal_level = "basal"
)
#> # A tibble: 2 × 131
#>   patient `prop@a` `prop@b` `prop@c` `prop@d` `prop@e` `prop@f` `prop@g`
#>   <chr>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1 kirby     0.0356   0.0413   0.0507   0.0281   0.0338   0.0281   0.0432
#> 2 mario     0.0236   0.0343   0.0407   0.0450   0.0535   0.0450   0.0493
#> # ℹ 123 more variables: `prop@h` <dbl>, `prop@i` <dbl>, `prop@j` <dbl>,
#> #   `prop@k` <dbl>, `prop@l` <dbl>, `prop@m` <dbl>, `prop@n` <dbl>,
#> #   `prop@o` <dbl>, `prop@p` <dbl>, `prop@q` <dbl>, `prop@r` <dbl>,
#> #   `prop@s` <dbl>, `prop@t` <dbl>, `prop@u` <dbl>, `prop@v` <dbl>,
#> #   `prop@w` <dbl>, `prop@x` <dbl>, `prop@y` <dbl>, `prop@z` <dbl>,
#> #   `cd45@a_ct` <dbl>, `cd38@a_ct` <dbl>, `cd45@b_ct` <dbl>, `cd38@b_ct` <dbl>,
#> #   `cd45@c_ct` <dbl>, `cd38@c_ct` <dbl>, `cd45@d_ct` <dbl>, …