Upsample cells into the cluster of their nearest neighbor a reference dataset

This function performs upsampling on CyTOF data by sorting single cells (passed into the function as `tof_tibble`) into their most phenotypically similar cell subpopulation in a reference dataset (passed into the function as `reference_tibble`). It does so by finding each cell in `tof_tibble`'s nearest neighbor in `reference_tibble` and assigning it to the cluster to which its nearest neighbor belongs. The nearest neighbor calculation can be performed with either euclidean or cosine distance.

Usage

tof_upsample_neighbor(
  tof_tibble,
  reference_tibble,
  reference_cluster_col,
  upsample_cols = where(tof_is_numeric),
  num_neighbors = 1L,
  distance_function = c("euclidean", "cosine", "l2", "ip")
)

Arguments

tof_tibble: A `tibble` or `tof_tbl` containing cells to be upsampled into their nearest reference subpopulation.
reference_tibble: A `tibble` or `tof_tibble` containing cells that have already been clustered or manually gated into subpopulations.
reference_cluster_col: An unquoted column name indicating which column in `reference_tibble` contains the subpopulation label (or cluster id) for each cell in `reference_tibble`.
upsample_cols: Unquoted column names indicating which columns in `tof_tibble` to use in computing the distances used for upsampling. Defaults to all numeric columns in `tof_tibble`. Supports tidyselect helpers.
num_neighbors: An integer indicating how many neighbors should be used in the nearest neighbor calculation. Clusters are assigned based on majority vote.
distance_function: A string indicating which distance function should be used to perform the upsampling. Options are "euclidean" (the default) and "cosine".

Value

A tibble with one column named `.upsample_cluster`, a character vector of length `nrow(tof_tibble)` indicating the id of the reference cluster to which each cell (i.e. each row) in `tof_tibble` was assigned.

Examples


# simulate single-cell data (and reference data with clusters to upsample
# into
sim_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 1000),
        cd38 = rnorm(n = 1000),
        cd34 = rnorm(n = 1000),
        cd19 = rnorm(n = 1000)
    )

reference_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 200),
        cd38 = rnorm(n = 200),
        cd34 = rnorm(n = 200),
        cd19 = rnorm(n = 200),
        cluster_id = c(rep("a", times = 100), rep("b", times = 100))
    )

# upsample using euclidean distance
tof_upsample_neighbor(
    tof_tibble = sim_data,
    reference_tibble = reference_data,
    reference_cluster_col = cluster_id
)
#> # A tibble: 1,000 × 1
#>    .upsample_cluster
#>    <chr>            
#>  1 a                
#>  2 b                
#>  3 a                
#>  4 b                
#>  5 a                
#>  6 b                
#>  7 b                
#>  8 b                
#>  9 b                
#> 10 a                
#> # ℹ 990 more rows

# upsample using cosine distance
tof_upsample_neighbor(
    tof_tibble = sim_data,
    reference_tibble = reference_data,
    reference_cluster_col = cluster_id,
    distance_function = "cosine"
)
#> # A tibble: 1,000 × 1
#>    .upsample_cluster
#>    <chr>            
#>  1 a                
#>  2 b                
#>  3 a                
#>  4 b                
#>  5 b                
#>  6 b                
#>  7 b                
#>  8 b                
#>  9 b                
#> 10 a                
#> # ℹ 990 more rows