Upsample cells into the closest cluster in a reference dataset

This function performs distance-based upsampling on CyTOF data by sorting single cells (passed into the function as `tof_tibble`) into their most phenotypically similar cell subpopulation in a reference dataset (passed into the function as `reference_tibble`). It does so by calculating the distance (either mahalanobis, cosine, or pearson) between each cell in `tof_tibble` and the centroid of each cluster in `reference_tibble`, then sorting cells into the cluster corresponding to their closest centroid.

Usage

tof_upsample_distance(
  tof_tibble,
  reference_tibble,
  reference_cluster_col,
  upsample_cols = where(tof_is_numeric),
  parallel_cols,
  distance_function = c("mahalanobis", "cosine", "pearson"),
  num_cores = 1L,
  return_distances = FALSE
)

Arguments

tof_tibble: A `tibble` or `tof_tbl` containing cells to be upsampled into their nearest reference subpopulation.
reference_tibble: A `tibble` or `tof_tibble` containing cells that have already been clustered or manually gated into subpopulations.
reference_cluster_col: An unquoted column name indicating which column in `reference_tibble` contains the subpopulation label (or cluster id) for each cell in `reference_tibble`.
upsample_cols: Unquoted column names indicating which columns in `tof_tibble` to use in computing the distances used for upsampling. Defaults to all numeric columns in `tof_tibble`. Supports tidyselect helpers.
parallel_cols: Optional. Unquoted column names indicating which columns in `tof_tibble` to use for breaking up the data in order to parallelize the upsampling using `foreach` on a `doParallel` backend. Supports tidyselect helpers.
distance_function: A string indicating which distance function should be used to perform the upsampling. Options are "mahalanobis" (the default), "cosine", and "pearson".
num_cores: An integer indicating the number of CPU cores used to parallelize the classification. Defaults to 1 (a single core).
return_distances: A boolean value indicating whether or not the returned result should include only one column, the cluster ids corresponding to each row of `tof_tibble` (return_distances = FALSE, the default), or if the returned result should include additional columns representing the distance between each row of `tof_tibble` and each of the reference subpopulation centroids (return_distances = TRUE).

Value

If `return_distances = FALSE`, a tibble with one column named `.upsample_cluster`, a character vector of length `nrow(tof_tibble)` indicating the id of the reference cluster to which each cell (i.e. each row) in `tof_tibble` was assigned.

If `return_distances = TRUE`, a tibble with `nrow(tof_tibble)` rows and num_clusters + 1 columns, where num_clusters is the number of clusters in `reference_tibble`. Each row represents a cell from `tof_tibble`, and num_clusters of the columns represent the distance between the cell and each of the reference subpopulations' cluster centroids. The final column represents the cluster id of the reference subpopulation with the minimum distance to the cell represented by that row.

Examples

# simulate single-cell data (and reference data with clusters to upsample
# into
sim_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 1000),
        cd38 = rnorm(n = 1000),
        cd34 = rnorm(n = 1000),
        cd19 = rnorm(n = 1000)
    )

reference_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 200),
        cd38 = rnorm(n = 200),
        cd34 = rnorm(n = 200),
        cd19 = rnorm(n = 200),
        cluster_id = c(rep("a", times = 100), rep("b", times = 100))
    )

# upsample using mahalanobis distance
tof_upsample_distance(
    tof_tibble = sim_data,
    reference_tibble = reference_data,
    reference_cluster_col = cluster_id
)
#> # A tibble: 1,000 × 1
#>    .upsample_cluster
#>    <chr>            
#>  1 a                
#>  2 b                
#>  3 a                
#>  4 a                
#>  5 b                
#>  6 a                
#>  7 a                
#>  8 a                
#>  9 b                
#> 10 b                
#> # ℹ 990 more rows

# upsample using cosine distance
tof_upsample_distance(
    tof_tibble = sim_data,
    reference_tibble = reference_data,
    reference_cluster_col = cluster_id,
    distance_function = "cosine"
)
#> # A tibble: 1,000 × 1
#>    .upsample_cluster
#>    <chr>            
#>  1 a                
#>  2 b                
#>  3 a                
#>  4 a                
#>  5 b                
#>  6 a                
#>  7 a                
#>  8 a                
#>  9 b                
#> 10 b                
#> # ℹ 990 more rows