Upsample cells into the closest cluster in a reference dataset

This function performs distance-based upsampling on CyTOF data by sorting single cells (passed into the function as `tof_tibble`) into their most phenotypically similar cell subpopulation in a reference dataset (passed into the function as `reference_tibble`). It does so by calculating the distance (either mahalanobis, cosine, or pearson) between each cell in `tof_tibble` and the centroid of each cluster in `reference_tibble`, then sorting cells into the cluster corresponding to their closest centroid.

Usage

tof_upsample(
  tof_tibble,
  reference_tibble,
  reference_cluster_col,
  upsample_cols = where(tof_is_numeric),
  ...,
  augment = TRUE,
  method = c("distance", "neighbor")
)

Arguments

tof_tibble: A `tibble` or `tof_tbl` containing cells to be upsampled into their nearest reference subpopulation.
reference_tibble: A `tibble` or `tof_tibble` containing cells that have already been clustered or manually gated into subpopulations.
reference_cluster_col: An unquoted column name indicating which column in `reference_tibble` contains the subpopulation label (or cluster id) for each cell in `reference_tibble`.
upsample_cols: Unquoted column names indicating which columns in `tof_tibble` to use in computing the distances used for upsampling. Defaults to all numeric columns in `tof_tibble`. Supports tidyselect helpers.
...: Additional arguments to pass to the `tof_upsample_*` function family member corresponding to the chosen method.
augment: A boolean value indicating if the output should column-bind the cluster ids of each cell as a new column in `tof_tibble` (TRUE, the default) or if a single-column tibble including only the cluster ids should be returned (FALSE).
method: A string indicating which clustering methods should be used. Valid values include "distance" (default) and "neighbor".

Value

A `tof_tbl` or `tibble` If augment = FALSE, it will have a single column encoding the upsampled cluster ids for each cell in `tof_tibble`. If augment = TRUE, it will have ncol(tof_tibble) + 1 columns: each of the (unaltered) columns in `tof_tibble` plus an additional column encoding the cluster ids.

Examples

# simulate single-cell data (and reference data with clusters to upsample
# into
sim_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 1000),
        cd38 = rnorm(n = 1000),
        cd34 = rnorm(n = 1000),
        cd19 = rnorm(n = 1000)
    )
reference_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 200),
        cd38 = rnorm(n = 200),
        cd34 = rnorm(n = 200),
        cd19 = rnorm(n = 200),
        cluster_id = c(rep("a", times = 100), rep("b", times = 100))
    )

# upsample using distance to cluster centroids
tof_upsample(
    tof_tibble = sim_data,
    reference_tibble = reference_data,
    reference_cluster_col = cluster_id,
    method = "distance"
)
#> # A tibble: 1,000 × 5
#>       cd45    cd38   cd34   cd19 .upsample_cluster
#>      <dbl>   <dbl>  <dbl>  <dbl> <chr>            
#>  1  1.19    1.33    0.908  1.77  a                
#>  2 -1.79    0.0663 -0.388 -1.38  b                
#>  3 -0.273   1.71   -0.706  0.366 a                
#>  4 -1.02   -0.0678 -0.265  1.37  a                
#>  5 -1.51    0.0333 -0.198 -0.204 b                
#>  6 -0.273   0.217   2.07   1.03  b                
#>  7  0.822   0.953   0.167 -1.07  a                
#>  8  0.783  -0.465   0.657  0.575 a                
#>  9  0.0513 -0.380  -0.509  1.08  a                
#> 10  0.856   0.153   0.388 -1.76  a                
#> # ℹ 990 more rows

# upsample using distance to nearest neighbor
tof_upsample(
    tof_tibble = sim_data,
    reference_tibble = reference_data,
    reference_cluster_col = cluster_id,
    method = "neighbor"
)
#> # A tibble: 1,000 × 5
#>       cd45    cd38   cd34   cd19 .upsample_cluster
#>      <dbl>   <dbl>  <dbl>  <dbl> <chr>            
#>  1  1.19    1.33    0.908  1.77  a                
#>  2 -1.79    0.0663 -0.388 -1.38  b                
#>  3 -0.273   1.71   -0.706  0.366 b                
#>  4 -1.02   -0.0678 -0.265  1.37  b                
#>  5 -1.51    0.0333 -0.198 -0.204 b                
#>  6 -0.273   0.217   2.07   1.03  b                
#>  7  0.822   0.953   0.167 -1.07  b                
#>  8  0.783  -0.465   0.657  0.575 a                
#>  9  0.0513 -0.380  -0.509  1.08  b                
#> 10  0.856   0.153   0.388 -1.76  a                
#> # ℹ 990 more rows