Perform developmental clustering on high-dimensional cytometry data.
Source:R/clustering.R
tof_cluster_ddpr.Rd
This function performs distance-based clustering on high-dimensional cytometry data by sorting cancer cells (passed into the function as `tof_tibble`) into their most phenotypically similar healthy cell subpopulation (passed into the function using `healthy_tibble`). For details about the algorithm used to perform the clustering, see this paper.
Arguments
- tof_tibble
A `tibble` or `tof_tbl` containing cells to be classified into their nearest healthy subpopulation (generally cancer cells).
- healthy_tibble
A `tibble` or `tof_tibble` containing cells from only healthy control samples (i.e. not disease samples).
- healthy_label_col
An unquoted column name indicating which column in `healthy_tibble` contains the subpopulation label (or cluster id) for each cell in `healthy_tibble`.
- cluster_cols
Unquoted column names indicating which columns in `tof_tibble` to use in computing the DDPR clusters. Defaults to all numeric columns in `tof_tibble`. Supports tidyselect helpers.
- distance_function
A string indicating which distance function should be used to perform the classification. Options are "mahalanobis" (the default), "cosine", and "pearson".
- num_cores
An integer indicating the number of CPU cores used to parallelize the classification. Defaults to 1 (a single core).
- parallel_cols
Optional. Unquoted column names indicating which columns in `tof_tibble` to use for breaking up the data in order to parallelize the classification using `foreach` on a `doParallel` backend. Supports tidyselect helpers.
- return_distances
A boolean value indicating whether or not the returned result should include only one column, the cluster ids corresponding to each row of `tof_tibble` (return_distances = FALSE, the default), or if the returned result should include additional columns representing the distance between each row of `tof_tibble` and each of the healthy subpopulation centroids (return_distances = TRUE).
- verbose
A boolean value indicating whether progress updates should be printed during developmental classification. Default is FALSE.
Value
If `return_distances = FALSE`, a tibble with one column named `.{distance_function}_cluster`, a character vector of length `nrow(tof_tibble)` indicating the id of the developmental cluster to which each cell (i.e. each row) in `tof_tibble` was assigned.
If `return_distances = TRUE`, a tibble with `nrow(tof_tibble)` rows and `nrow(classifier_fit) + 1` columns. Each row represents a cell from `tof_tibble`, and `nrow(classifier_fit)` of the columns represent the distance between the cell and each of the healthy subpopulations' cluster centroids. The final column represents the cluster id of the healthy subpopulation with the minimum distance to the cell represented by that row.
If `return_distances = FALSE`, a tibble with one column named `.{distance_function}_cluster`. This column will contain an integer vector of length `nrow(tof_tibble)` indicating the id of the developmental cluster to which each cell (i.e. each row) in `tof_tibble` was assigned.
See also
Other clustering functions:
tof_cluster()
,
tof_cluster_flowsom()
,
tof_cluster_kmeans()
,
tof_cluster_phenograph()
Examples
sim_data <-
dplyr::tibble(
cd45 = rnorm(n = 1000),
cd38 = rnorm(n = 1000),
cd34 = rnorm(n = 1000),
cd19 = rnorm(n = 1000)
)
healthy_data <-
dplyr::tibble(
cd45 = rnorm(n = 200),
cd38 = rnorm(n = 200),
cd34 = rnorm(n = 200),
cd19 = rnorm(n = 200),
cluster_id = c(rep("a", times = 100), rep("b", times = 100))
)
tof_cluster_ddpr(
tof_tibble = sim_data,
healthy_tibble = healthy_data,
healthy_label_col = cluster_id
)
#> # A tibble: 1,000 × 1
#> .mahalanobis_cluster
#> <chr>
#> 1 b
#> 2 b
#> 3 b
#> 4 a
#> 5 b
#> 6 b
#> 7 b
#> 8 a
#> 9 a
#> 10 b
#> # ℹ 990 more rows