Extract aggregated features from CyTOF data using the Jensen-Shannon Distance (JSD)
Source:R/feature_extraction.R
      tof_extract_jsd.RdThis feature extraction function calculates the Jensen-Shannon Distance (JSD) between the stimulated and unstimulated ("basal") experimental conditions of samples in a CyTOF experiment. This calculation is performed across a user-specified selection of CyTOF antigens and can be performed either overall (across all cells in the dataset) or after breaking down the cells into subgroups using `group_cols`.
Arguments
- tof_tibble
- A `tof_tbl` or a `tibble`. 
- cluster_col
- An unquoted column name indicating which column in `tof_tibble` stores the cluster ids of the cluster to which each cell belongs. Cluster labels can be produced via any method the user chooses - including manual gating, any of the functions in the `tof_cluster_*` function family, or any other method. 
- group_cols
- Unquoted column names representing which columns in `tof_tibble` should be used to break the rows of `tof_tibble` into subgroups for the feature extraction calculation. Defaults to NULL (i.e. performing the extraction without subgroups). 
- marker_cols
- Unquoted column names representing which columns in `tof_tibble` (i.e. which CyTOF protein measurements) should be included in the feature extraction calculation. Defaults to all numeric (integer or double) columns. Supports tidyselect helpers. 
- jsd_col
- An unquoted column name that indicates which column in `tof_tibble` contains information about which stimulation condition each cell was exposed to during data acquisition. - If provided, the feature extraction will be further broken down into subgroups by stimulation condition (and features from each stimulation condition will be included as their own features in wide format). 
- reference_level
- A string indicating what the value in `jsd_col` corresponds to the basal stimulation condition (i.e. "basal" or "unstimulated"). 
- format
- A string indicating if the data should be returned in "wide" format (the default; each cluster feature is given its own column) or in "long" format (each cluster feature is provided as its own row). 
- num_bins
- Optional. The number of bins to use in dividing one-dimensional marker distributions into discrete segments for the JSD calculation. Defaults to 100. 
Value
A tibble.
If format == "wide", the tibble will have 1 row for each combination of the grouping variables provided in `group_cols` and one column for each grouping variable, one column for each extracted feature (the JSD between the distribution of a given marker in a given cluster in the basal condition and the distribution of that marker in the same cluster in a stimulated condition). The names of each column containing cluster features is obtained using the following pattern: "{stimulation_id}_{marker_id}@{cluster_id}_jsd".
If format == "long", the tibble will have 1 row for each combination of the grouping variables in `group_cols`, each cluster id (i.e. level) in `cluster_col`, and each marker in `marker_cols`. It will have one column for each grouping variable, one column for the cluster ids, one column for the CyTOF channel names, and one column (`value`) containing the features.
See also
Other feature extraction functions:
tof_extract_central_tendency(),
tof_extract_emd(),
tof_extract_features(),
tof_extract_proportion(),
tof_extract_threshold()
Examples
sim_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 1000),
        cd38 = rnorm(n = 1000),
        cd34 = rnorm(n = 1000),
        cd19 = rnorm(n = 1000),
        cluster_id = sample(letters, size = 1000, replace = TRUE),
        patient = sample(c("kirby", "mario"), size = 1000, replace = TRUE),
        stim = sample(c("basal", "stim"), size = 1000, replace = TRUE)
    )
# extract jsd of each cluster in each patient (using the "basal" stim
# condition as a reference) in wide format
tof_extract_jsd(
    tof_tibble = sim_data,
    cluster_col = cluster_id,
    group_cols = patient,
    jsd_col = stim,
    reference_level = "basal"
)
#> # A tibble: 2 × 105
#>   patient `stim_cd45@m_jsd` `stim_cd38@m_jsd` `stim_cd34@m_jsd`
#>   <chr>               <dbl>             <dbl>             <dbl>
#> 1 mario               0.909             0.909             0.909
#> 2 kirby              NA                NA                NA    
#> # ℹ 101 more variables: `stim_cd19@m_jsd` <dbl>, `stim_cd45@x_jsd` <dbl>,
#> #   `stim_cd38@x_jsd` <dbl>, `stim_cd34@x_jsd` <dbl>, `stim_cd19@x_jsd` <dbl>,
#> #   `stim_cd45@y_jsd` <dbl>, `stim_cd38@y_jsd` <dbl>, `stim_cd34@y_jsd` <dbl>,
#> #   `stim_cd19@y_jsd` <dbl>, `stim_cd45@t_jsd` <dbl>, `stim_cd38@t_jsd` <dbl>,
#> #   `stim_cd34@t_jsd` <dbl>, `stim_cd19@t_jsd` <dbl>, `stim_cd45@f_jsd` <dbl>,
#> #   `stim_cd38@f_jsd` <dbl>, `stim_cd34@f_jsd` <dbl>, `stim_cd19@f_jsd` <dbl>,
#> #   `stim_cd45@d_jsd` <dbl>, `stim_cd38@d_jsd` <dbl>, …
# extract jsd of each cluster (using the "basal" stim
# condition as a reference) in long format
tof_extract_jsd(
    tof_tibble = sim_data,
    cluster_col = cluster_id,
    jsd_col = stim,
    reference_level = "basal",
    format = "long"
)
#> # A tibble: 104 × 4
#>    cluster_id marker stimulation   jsd
#>    <chr>      <chr>  <chr>       <dbl>
#>  1 m          cd45   stim        0.818
#>  2 m          cd38   stim        0.680
#>  3 m          cd34   stim        0.818
#>  4 m          cd19   stim        0.620
#>  5 x          cd45   stim        0.772
#>  6 x          cd38   stim        0.675
#>  7 x          cd34   stim        0.662
#>  8 x          cd19   stim        0.806
#>  9 y          cd45   stim        0.753
#> 10 y          cd38   stim        0.679
#> # ℹ 94 more rows