Perform developmental clustering on high-dimensional cytometry data.

This function performs distance-based clustering on high-dimensional cytometry data by sorting cancer cells (passed into the function as `tof_tibble`) into their most phenotypically similar healthy cell subpopulation (passed into the function using `healthy_tibble`). For details about the algorithm used to perform the clustering, see this paper.

Usage

tof_cluster_ddpr(
  tof_tibble,
  healthy_tibble,
  healthy_label_col,
  cluster_cols = where(tof_is_numeric),
  distance_function = c("mahalanobis", "cosine", "pearson"),
  num_cores = 1L,
  parallel_cols,
  return_distances = FALSE,
  verbose = FALSE
)

Arguments

tof_tibble: A `tibble` or `tof_tbl` containing cells to be classified into their nearest healthy subpopulation (generally cancer cells).
healthy_tibble: A `tibble` or `tof_tibble` containing cells from only healthy control samples (i.e. not disease samples).
healthy_label_col: An unquoted column name indicating which column in `healthy_tibble` contains the subpopulation label (or cluster id) for each cell in `healthy_tibble`.
cluster_cols: Unquoted column names indicating which columns in `tof_tibble` to use in computing the DDPR clusters. Defaults to all numeric columns in `tof_tibble`. Supports tidyselect helpers.
distance_function: A string indicating which distance function should be used to perform the classification. Options are "mahalanobis" (the default), "cosine", and "pearson".
num_cores: An integer indicating the number of CPU cores used to parallelize the classification. Defaults to 1 (a single core).
parallel_cols: Optional. Unquoted column names indicating which columns in `tof_tibble` to use for breaking up the data in order to parallelize the classification using `foreach` on a `doParallel` backend. Supports tidyselect helpers.
return_distances: A boolean value indicating whether or not the returned result should include only one column, the cluster ids corresponding to each row of `tof_tibble` (return_distances = FALSE, the default), or if the returned result should include additional columns representing the distance between each row of `tof_tibble` and each of the healthy subpopulation centroids (return_distances = TRUE).
verbose: A boolean value indicating whether progress updates should be printed during developmental classification. Default is FALSE.

Value

If `return_distances = FALSE`, a tibble with one column named `.{distance_function}_cluster`, a character vector of length `nrow(tof_tibble)` indicating the id of the developmental cluster to which each cell (i.e. each row) in `tof_tibble` was assigned.

If `return_distances = TRUE`, a tibble with `nrow(tof_tibble)` rows and `nrow(classifier_fit) + 1` columns. Each row represents a cell from `tof_tibble`, and `nrow(classifier_fit)` of the columns represent the distance between the cell and each of the healthy subpopulations' cluster centroids. The final column represents the cluster id of the healthy subpopulation with the minimum distance to the cell represented by that row.

If `return_distances = FALSE`, a tibble with one column named `.{distance_function}_cluster`. This column will contain an integer vector of length `nrow(tof_tibble)` indicating the id of the developmental cluster to which each cell (i.e. each row) in `tof_tibble` was assigned.

Examples

sim_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 1000),
        cd38 = rnorm(n = 1000),
        cd34 = rnorm(n = 1000),
        cd19 = rnorm(n = 1000)
    )

healthy_data <-
    dplyr::tibble(
        cd45 = rnorm(n = 200),
        cd38 = rnorm(n = 200),
        cd34 = rnorm(n = 200),
        cd19 = rnorm(n = 200),
        cluster_id = c(rep("a", times = 100), rep("b", times = 100))
    )

tof_cluster_ddpr(
    tof_tibble = sim_data,
    healthy_tibble = healthy_data,
    healthy_label_col = cluster_id
)
#> # A tibble: 1,000 × 1
#>    .mahalanobis_cluster
#>    <chr>               
#>  1 b                   
#>  2 b                   
#>  3 b                   
#>  4 a                   
#>  5 b                   
#>  6 b                   
#>  7 b                   
#>  8 a                   
#>  9 a                   
#> 10 b                   
#> # ℹ 990 more rows

Perform developmental clustering on high-dimensional cytometry data.

Usage

Arguments

Value

See also

Examples