Upsample cells into the cluster of their nearest neighbor a reference dataset
Source:R/upsample.R
tof_upsample_neighbor.Rd
This function performs upsampling on CyTOF data by sorting single cells (passed into the function as `tof_tibble`) into their most phenotypically similar cell subpopulation in a reference dataset (passed into the function as `reference_tibble`). It does so by finding each cell in `tof_tibble`'s nearest neighbor in `reference_tibble` and assigning it to the cluster to which its nearest neighbor belongs. The nearest neighbor calculation can be performed with either euclidean or cosine distance.
Arguments
- tof_tibble
A `tibble` or `tof_tbl` containing cells to be upsampled into their nearest reference subpopulation.
- reference_tibble
A `tibble` or `tof_tibble` containing cells that have already been clustered or manually gated into subpopulations.
- reference_cluster_col
An unquoted column name indicating which column in `reference_tibble` contains the subpopulation label (or cluster id) for each cell in `reference_tibble`.
- upsample_cols
Unquoted column names indicating which columns in `tof_tibble` to use in computing the distances used for upsampling. Defaults to all numeric columns in `tof_tibble`. Supports tidyselect helpers.
- num_neighbors
An integer indicating how many neighbors should be used in the nearest neighbor calculation. Clusters are assigned based on majority vote.
- distance_function
A string indicating which distance function should be used to perform the upsampling. Options are "euclidean" (the default) and "cosine".
Value
A tibble with one column named `.upsample_cluster`, a character vector of length `nrow(tof_tibble)` indicating the id of the reference cluster to which each cell (i.e. each row) in `tof_tibble` was assigned.
Examples
# simulate single-cell data (and reference data with clusters to upsample
# into
sim_data <-
dplyr::tibble(
cd45 = rnorm(n = 1000),
cd38 = rnorm(n = 1000),
cd34 = rnorm(n = 1000),
cd19 = rnorm(n = 1000)
)
reference_data <-
dplyr::tibble(
cd45 = rnorm(n = 200),
cd38 = rnorm(n = 200),
cd34 = rnorm(n = 200),
cd19 = rnorm(n = 200),
cluster_id = c(rep("a", times = 100), rep("b", times = 100))
)
# upsample using euclidean distance
tof_upsample_neighbor(
tof_tibble = sim_data,
reference_tibble = reference_data,
reference_cluster_col = cluster_id
)
#> # A tibble: 1,000 × 1
#> .upsample_cluster
#> <chr>
#> 1 a
#> 2 b
#> 3 a
#> 4 b
#> 5 a
#> 6 b
#> 7 b
#> 8 b
#> 9 b
#> 10 a
#> # ℹ 990 more rows
# upsample using cosine distance
tof_upsample_neighbor(
tof_tibble = sim_data,
reference_tibble = reference_data,
reference_cluster_col = cluster_id,
distance_function = "cosine"
)
#> # A tibble: 1,000 × 1
#> .upsample_cluster
#> <chr>
#> 1 a
#> 2 b
#> 3 a
#> 4 b
#> 5 b
#> 6 b
#> 7 b
#> 8 b
#> 9 b
#> 10 a
#> # ℹ 990 more rows