This function performs distance-based upsampling on CyTOF data by sorting single cells (passed into the function as `tof_tibble`) into their most phenotypically similar cell subpopulation in a reference dataset (passed into the function as `reference_tibble`). It does so by calculating the distance (either mahalanobis, cosine, or pearson) between each cell in `tof_tibble` and the centroid of each cluster in `reference_tibble`, then sorting cells into the cluster corresponding to their closest centroid.
Arguments
- tof_tibble
A `tibble` or `tof_tbl` containing cells to be upsampled into their nearest reference subpopulation.
- reference_tibble
A `tibble` or `tof_tibble` containing cells that have already been clustered or manually gated into subpopulations.
- reference_cluster_col
An unquoted column name indicating which column in `reference_tibble` contains the subpopulation label (or cluster id) for each cell in `reference_tibble`.
- upsample_cols
Unquoted column names indicating which columns in `tof_tibble` to use in computing the distances used for upsampling. Defaults to all numeric columns in `tof_tibble`. Supports tidyselect helpers.
- ...
Additional arguments to pass to the `tof_upsample_*` function family member corresponding to the chosen method.
- augment
A boolean value indicating if the output should column-bind the cluster ids of each cell as a new column in `tof_tibble` (TRUE, the default) or if a single-column tibble including only the cluster ids should be returned (FALSE).
- method
A string indicating which clustering methods should be used. Valid values include "distance" (default) and "neighbor".
Value
A `tof_tbl` or `tibble` If augment = FALSE, it will have a single column encoding the upsampled cluster ids for each cell in `tof_tibble`. If augment = TRUE, it will have ncol(tof_tibble) + 1 columns: each of the (unaltered) columns in `tof_tibble` plus an additional column encoding the cluster ids.
Examples
# simulate single-cell data (and reference data with clusters to upsample
# into
sim_data <-
dplyr::tibble(
cd45 = rnorm(n = 1000),
cd38 = rnorm(n = 1000),
cd34 = rnorm(n = 1000),
cd19 = rnorm(n = 1000)
)
reference_data <-
dplyr::tibble(
cd45 = rnorm(n = 200),
cd38 = rnorm(n = 200),
cd34 = rnorm(n = 200),
cd19 = rnorm(n = 200),
cluster_id = c(rep("a", times = 100), rep("b", times = 100))
)
# upsample using distance to cluster centroids
tof_upsample(
tof_tibble = sim_data,
reference_tibble = reference_data,
reference_cluster_col = cluster_id,
method = "distance"
)
#> # A tibble: 1,000 × 5
#> cd45 cd38 cd34 cd19 .upsample_cluster
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1.19 1.33 0.908 1.77 a
#> 2 -1.79 0.0663 -0.388 -1.38 b
#> 3 -0.273 1.71 -0.706 0.366 a
#> 4 -1.02 -0.0678 -0.265 1.37 a
#> 5 -1.51 0.0333 -0.198 -0.204 b
#> 6 -0.273 0.217 2.07 1.03 b
#> 7 0.822 0.953 0.167 -1.07 a
#> 8 0.783 -0.465 0.657 0.575 a
#> 9 0.0513 -0.380 -0.509 1.08 a
#> 10 0.856 0.153 0.388 -1.76 a
#> # ℹ 990 more rows
# upsample using distance to nearest neighbor
tof_upsample(
tof_tibble = sim_data,
reference_tibble = reference_data,
reference_cluster_col = cluster_id,
method = "neighbor"
)
#> # A tibble: 1,000 × 5
#> cd45 cd38 cd34 cd19 .upsample_cluster
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1.19 1.33 0.908 1.77 a
#> 2 -1.79 0.0663 -0.388 -1.38 b
#> 3 -0.273 1.71 -0.706 0.366 b
#> 4 -1.02 -0.0678 -0.265 1.37 b
#> 5 -1.51 0.0333 -0.198 -0.204 b
#> 6 -0.273 0.217 2.07 1.03 b
#> 7 0.822 0.953 0.167 -1.07 b
#> 8 0.783 -0.465 0.657 0.575 a
#> 9 0.0513 -0.380 -0.509 1.08 b
#> 10 0.856 0.153 0.388 -1.76 a
#> # ℹ 990 more rows